A tokenizer converts text to tokens in the process known as [[tokenization]].
A tokenizer contains a vocabulary--the exhaustive list of all tokens in the system. A tokenizer focused on English text will have a very different vocabulary from a tokenizer focused on coding. Special tokens may be reserved for beginning of the text string, the start of a sentence, end of sentence, etc. These special tokens require no special treatment during inference or fine tuning, they are treated the same as all other tokens. Where a token represents the beginning of a word, it will include a space in front. Tokens are also typically case sensitive.
The tokenizer decomposes the text string into the vocabulary of tokens. Each token is identified by a unique ID.
In [[HuggingFace]] use the `AutoTokenizer` from the `transformers` library.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('<model>', trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token # pad with end of sentence
```