byte-pair encoding

Byte-pair encoding is the most popular [[tokenization]] algorithm, used by OpenAI, Llama and Mistral models. The pseudocode is 1. Let an initial vocabulary be the set of all individual characters (e.g., `A-Za-z0-9`). 2. For $k$ steps 1. Choose the most frequent occurring adjacent pair of vocabulary items (e.g., `t` and `h`) 2. Add a new merged symbol `th` to the vocabulary 3. Replace every adjacent pair of `t` and `h` in the corpus to `th` *Note the similarity to [[Huffman codes]], but in reverse.* There are no out-of-vocabulary values (provided all characters were included in the initial vocabulary). Depending on the [[encoding]], you may encounter characters that were not included in the initial vocabulary. A common approach is to start with a version of [[Unicode]] as the initial set of characters then resort to byte encoding for unknown characters. Practical implementations require some pre-processing, including sentence segmentation and some use of regular expressions to capture pairs that should not be merged (e.g., letters and punctuation). During the encoding step, BPE uses the learned list of merges to guide tokenization, applying the merges in the order they were learned.