Given a training corpus, tokenization algorithms identify the base words and subwords (in subword tokenization) to create a [[vocabulary]]. Tokens serve as a convenient approximation of [[morphemes]]. Subword tokenization does an impressive job of representing the vocabulary while being robust to OOV words like proper nouns, misspellings, and rare words.
There are three commonly used algorithms.
- [[byte-pair encoding]] (Gage 1994, Sennrich et al., 2016)
- WordPieces (Shuster and Nakajima, 2012)
- SentencePiece
- Unigram LM (Kudo, 2018)
Subword tokenization algorithms consist of
- a **learner** that takes a raw training corpus and induces a vocabulary consisting of N tokens, where N is specified by the user,
- a **tokenizer** (or encoder) that takes a pre-segmented input and tokenizes it according to a given vocabulary, and
- a **decoder** that takes a list of integer indices and generates a surface form.
Use [Tiktokenizer](https://tiktokenizer.vercel.app) to compare tokenization across different models.