tokenizer

A tokenizer converts text to tokens in the process known as [[tokenization]]. A tokenizer contains a vocabulary--the exhaustive list of all tokens in the system. A tokenizer focused on English text will have a very different vocabulary from a tokenizer focused on coding. Special tokens may be reserved for beginning of the text string, the start of a sentence, end of sentence, etc. These special tokens require no special treatment during inference or fine tuning, they are treated the same as all other tokens. Where a token represents the beginning of a word, it will include a space in front. Tokens are also typically case sensitive. The tokenizer decomposes the text string into the vocabulary of tokens. Each token is identified by a unique ID. In [[HuggingFace]] use the `AutoTokenizer` from the `transformers` library. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('<model>', trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token # pad with end of sentence ```