tokenization

Given a training corpus, tokenization algorithms identify the base words and subwords (in subword tokenization) to create a [[vocabulary]]. Tokens serve as a convenient approximation of [[morphemes]]. Subword tokenization does an impressive job of representing the vocabulary while being robust to OOV words like proper nouns, misspellings, and rare words. There are three commonly used algorithms. - [[byte-pair encoding]] (Gage 1994, Sennrich et al., 2016) - WordPieces (Shuster and Nakajima, 2012) - SentencePiece - Unigram LM (Kudo, 2018) Subword tokenization algorithms consist of - a **learner** that takes a raw training corpus and induces a vocabulary consisting of N tokens, where N is specified by the user, - a **tokenizer** (or encoder) that takes a pre-segmented input and tokenizes it according to a given vocabulary, and - a **decoder** that takes a list of integer indices and generates a surface form. Use [Tiktokenizer](https://tiktokenizer.vercel.app) to compare tokenization across different models.