A vocabulary is the list of words known to the system. Anything not in the vocabulary is called **out-of-vocabulary (OOV)**.
Smaller vocabularies have lower memory requirements but tend to lead to longer input sequences, as words are tokenized into smaller units. The trend in [[LLM]]s is towards large vocabulary sizes, ranging from 100,000 (GPT-4) to 250,000 (Gemma).
A vocabulary consists of **word types**, the distinct words in a corpus. **Word instances** are the uses of each word type. Depending on the task, the words "the" and "The", for instance, can be the same or a different word type. For example, the **Google n-gram corpus** has 1 trillion word instances and 13 million word types. The Oxford English Dictionary contains approximately 615,000 entries.