## Foundational Large Language Models & Text Generation
> [!Cite]-
> Barektain, Mohammadamin, Anant Nawalgaria, Daniel J Mankowitz, Majd Al Merey, Yaniv Leviathan, Massimo Mascaro, Matan Kalman, Elena Buchatskaya, Aliaksei Severyn, and Antonio Gulli. “Foundational Large Language Models & Text Generation,” n.d.
>
> [online](http://zotero.org/users/local/kycSZ2wR/items/I74QB8FQ) [local](zotero://select/library/items/I74QB8FQ) [pdf](file://C:\Users\erikt\Zotero\storage\K2WUJGJH\Barektain%20et%20al.%20-%20Foundational%20Large%20Language%20Models%20&%20Text%20Generati.pdf)
## Notes
%% begin notes %%
%% end notes %%
%% begin annotations %%
### Imported: 2024-11-13 9:06 am
An LLM is an advanced artificial intelligence system that specializes in processing, understanding, and generating human-like text. These systems are typically implemented as a deep neural network and are trained on massive amounts of text data.
This whitepaper dives into the timeline of the various architectures and approaches building up to the large language models and the architectures being used at the time of publication. It also discusses fine-tuning techniques to customize an LLM to a certain domain or task, methods to make the training more efficient, as well as methods to accelerate inference. These are then followed by various applications and code examples.
LLMs achieve an impressive performance boost from the previous state of the art across a variety of different and complex tasks which require answering questions or complex reasoning, making feasible many new applications. These include language translation, code generation and completion, text generation, text classification, and question-answering, to name a few.
The transformer architecture was developed at Google in 2017 for use in a translation model.1 It’s a sequence-to-sequence model capable of converting sequences from one domain into sequences in another domain. For example, translating French sentences to English sentences. The original transformer architecture consists of two parts: an encoder and a decoder. The encoder converts the input text (e.g., a French sentence) into a representation, which is then passed to the decoder. The decoder uses this representation to generate the output text (e.g., an English translation) autoregressively. Notably, the size of the output of the transformer encoder is linear in the size of its input.
Normalization (Optional): Standardizes text by removing redundant whitespace, accents, etc.
Tokenization: Breaks the sentence into words or subwords and maps them to integer token IDs from a vocabulary.
Embedding: Converts each token ID to its corresponding high-dimensional vector, typically using a lookup table. These can be learned during the training process.
Positional Encoding: Adds information about the position of each token in the sequence to help the transformer understand word order.
The use of multi-head attention improves the model’s ability to handle complex language patterns and long-range dependencies. This is crucial for tasks that require a nuanced understanding of language structure and content, such as machine translation, text summarization, and question-answering. The mechanism enables the transformer to consider multiple interpretations and representations of the input, which enhances its performance on these tasks.
Majority of recent LLMs adopted a decoder-only variant of transformer architecture. This approach forgoes the traditional encoder-decoder separation, focusing instead on directly generating the output sequence from the input. The input sequence undergoes a similar process of embedding and positional encoding before being fed into the decoder. The decoder then uses masked self-attention to generate predictions for each subsequent token based on the previously generated tokens. This streamlined approach simplifies the architecture for specific tasks where encoding and decoding can be effectively merged.
Decoder-only models are typically pre-trained on the language modeling task (e.g., see endnote12, 13). The target sequence for the decoder is simply a shifted version of the input sequence. Given a training sequence like ‘the cat sat on the mat’ various input/target pairs can be generated for the model. For example the input “the cat sat on” should predict “the” and subsequently the input “the cat sat on the” should predict target sequence “mat”.
Encoder-only models (like BERT)14 are often pre-trained by corrupting the input sequence in some way and having the model try to reconstruct it. One such approach is masked language modeling (MLM).14 In our example, the input sequence could be “The MASK sat on the mat” and the sequence target would be the original sentence.
Encoder-decoder models (like the original transformer) are trained on sequence-to-sequence supervised tasks such as translation (input sequence “Le chat est assis sur le tapis” and target “The cat sat on the mat”), question-answering (where the input sequence is a question and the target sequence is the corresponding answer), and summarization (where the input sequence is a full article and the target sequence is its corresponding summary). These models could also be trained in an unsupervised way by converting other tasks into sequence-to-sequence format. For example, when training on Wikipedia data, the input sequence might be the first part of an article, and the target sequence comprises the remainder of the article.
An additional factor to consider during training is the ‘context length’. This refers to the number of previous tokens the model can ‘remember’ and use to predict the next token in the sequence. Longer context lengths allow the model to capture more complex relationships and dependencies within the text, potentially leading to better performance. However, longer contexts also require more computational resources and memory, which can slow down training and inference. Choosing an appropriate context length involves balancing these trade-offs based on the specific task and available resources.
GPT-1 (Generative pre-trained transformer version 1)15 was a decoder-only model developed by OpenAI in 2018. It was trained on the BooksCorpus dataset (containing approximately several billion words) and is able to generate text, translate languages, write different kinds of creative content, and answer questions in an informative way.
BERT14 which stands for Bidirectional Encoder Representations from Transformers, distinguishes itself from traditional encoder-decoder transformer models by being an encoder-only architecture.
BERT captures intricate context dependencies from both the left and right of a word, and it can discern the relationship between pairs of sentences. Such capabilities make BERT especially good at tasks that require natural language understanding, such as questionanswering, sentiment analysis, and natural language inference, among others. Since this is an encoder-only model, BERT cannot generate text.
GPT-2,12 the successor to GPT-1, was released in 2019 by OpenAI. The main innovation of GPT-2 was a direct scale-up, with a tenfold increase in both its parameter count and the size of its training dataset
GPT-2 was trained on a large (40GB) and diverse dataset called WebText, which consists of 45 million webpages from Reddit with a Karma rating of at least three. Karma is a rating metric used on Reddit and a value of three means that all the posts were of a reasonable level of quality.
GPT-2’s most significant achievement was its ability to perform zero-shot learning on a variety of tasks. Zero-shot task transfer is the ability of a model to generalize to a new task without being trained on it, which requires the model to understand the task based on the given instruction.
The most noticeable difference is the sheer size of GPT-3, boasting a whopping 175 billion parameters, compared to GPT-2’s largest model which had 1.5 billion parameters.
GPT-3.5 models, including GPT-3.5 turbo, improve over GPT-3 as it is capable of understanding and generating code. It’s been optimized for dialogue. And it’s capable of receiving context windows of up to 16,385 tokens and can generate outputs of up to 4,096 tokens.
GPT-4 extends GPT-3.5 as a large multimodal model capable of processing image and text inputs and producing text outputs.19 Specifically, accepting text or images as input and outputting text. This model has broader general knowledge and advanced reasoning capabilities. It can receive context windows of up to 128,000 tokens and has a maximum output of 4,096 tokens. GPT-4 demonstrates remarkable versatility by solving complex tasks across diverse fields like mathematics, coding, vision, medicine, law, and psychology – all without specialized instructions. Its performance often matches or even exceeds human capabilities and significantly outperforms earlier models like GPT-3.5.
Google’s LaMDA,20 which stands for ‘Language Model for Dialogue Applications’ is another contribution to the arena of large-scale language models, designed primarily to engage in open-ended conversations.
GPT models shine on their ability to produce coherent long-form content and perform various tasks with minimal prompting, whereas LaMDA emphasizes the flow and progression of dialogue, striving to mimic the unpredictability and richness of human conversations.
Gopher22 is a 280 billion parameter language model based on the decoder-only transformer architecture, developed by DeepMind in 2021.22 It can generate text, translate languages, write different kinds of creative content, and answer your questions in an informative way.
The researchers curated a high-quality text dataset called MassiveText, which contains over 10 terabytes of data and 2.45B documents from web pages, books, news articles, and code (GitHub). They only trained on 300B tokens, which is 12% of the dataset. Importantly, they improved the quality of the data by filtering it, such as by removing duplicate text and deduplicating similar documents. This significantly improved the model’s performance on downstream tasks.
GLaM (Generalist Language Model)23 was the first sparsely-activated mixture-of-experts language model. Mixture-of-experts based models are much more computationally efficient given their parameter count. This is achieved by only activating a subset of their parameters (i.e. experts) for each input token. GLaM consists of 1.2 trillion parameters but uses only ⅓ of the energy used to train GPT-3 and half of the FLOPs for inference while achieving better overall performance compared to GPT-3.
The Chinchilla paper,25 revisited the compute optimal scaling laws and used three different approaches to find that near equal scaling in parameters and data is optimal with increasing compute. Thus, a 100-fold increase in compute should translate into a tenfold increase in both data size and model size.
To verify the updated scaling law, DeepMind trained a 70B parameter model (called Chinchilla) using the same compute budget as the previously trained Gopher model. Chinchilla uniformly and significantly outperformed Gopher (280B),21 GPT-3 (175B),13 and Megatron-Turing NLG (530B)26 on a large range of downstream evaluation tasks. Due to being 4x smaller than Gopher, both the memory footprint and the inference cost of Chinchilla are also smaller.
Pathways language model (PaLM)28 is a 540-billion parameter transformer-based large language model developed by Google AI. It was trained on a massive dataset of text and code and is capable of performing a wide range of tasks, including common sense reasoning, arithmetic reasoning, joke explanation, code generation, and translation.
PaLM 230 is a successor to PaLM that was announced in May 2023. Thanks to a number of architectural and training enhancements, PaLM 2 is even more capable than PaLM, with fewer total parameters. It excels at advanced reasoning tasks, including code generation, math, classification, question answering, and translation.
Gemini31 (Figure 6) is a state-of-the-art multimodal language family of models that can take interleaved sequences of text, image, audio, and video as input. It’s built on top of transformer decoders and has architectural improvements for scale as well as optimized inference on Google’s Tensor Processing Units (TPUs). In its current 1.5 version, these models are trained to support contexts of different sizes, up to 2M tokens in the Gemini 1.5 Pro version on Vertex AI and employ mechanisms such as multi-query attention for efficiency. Gemini models also employ a Mixture of Experts architecture to optimize efficiency and capabilities of the models. Multimodality allows the models to process text, images and video in input, with more modalities in input and output expected in the future.
During the initial part of 2024, Google introduced the latest model of the Gemini family, Gemini 1.5 Pro,32 a highly compute-efficient multimodal mixture-of-experts model. This model also dramatically increased the size of the context window to millions of tokens and is capable of recalling and reasoning over those millions of tokens, including multiple long documents and hours of video and audio.
Gemini Flash is a new addition to the Gemini model family and the fastest Gemini model served in the API. It’s optimized for high-volume, high-frequency tasks at scale, is more cost-efficient to serve and features a breakthrough long context window of 1 million tokens. Although it is a lighter weight model than 1.5 Pro, it is highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size.
Gemma 2,33 developed by Google AI, represents a significant advancement in the field of open large language models. Designed with a focus on efficiency, the 27-billion parameter model boasts performance comparable to much larger models like Llama 3 70B33 on standard benchmarks.
Released by Meta AI, LLaMA 3.2 is the next generation of their open LLMs. Llama 3.2 includes multilingual text-only models (1B, 3B) and vision LLMs (11B, 90B), with quantized versions of 1B and 3B offering on average up to 56% smaller size and 2-3x speedup, ideal for on-device and edge deployments. LLaMA 3.2 utilizes grouped-query attention and a 128K token vocabulary for enhanced performance and efficiency.
Developed by Mistral AI, Mixtral 8x7B is a Sparse Mixture of Experts (SMoE) model. While its total parameter count is 47B, it utilizes only 13B active parameters per token during inference, leading to faster inference and higher throughput. This model excels in mathematics, code generation, and multilingual tasks, often outperforming LLaMA 2 70B in these domains.
Large language models typically undergo multiple training stages. The first stage, often referred to as pre-training, is the foundational stage where an LLM is trained on large, diverse, and unlabelled text datasets where it’s tasked to predict the next token given the previous context. The goal of this stage is to leverage a large, general distribution of data and to create a model that is good at sampling from this general distribution.
Pretraining is the most expensive in terms of time (from weeks to months depending on the size of the model) and the amount of required computational resources, (GPU/TPU hours)
After training, the model can be further specialized via fine-tuning, typically called instruction-tuning or simply supervised fine-tuning (SFT).
As mentioned in the previous section, SFT is the process of improving an LLM’s performance on a specific task or set of tasks by further training it on domain-specific, labeled data. The dataset is typically significantly smaller than the pre-training datasets, and is usually humancurated and of high quality.
Typically, after performing SFT, a second stage of fine-tuning occurs which is called reinforcement learning from human feedback (RLHF). This is a very powerful fine-tuning technique that enables an LLM to better align with human-preferred responses (i.e. making its responses more helpful, truthful, safer, etc.).
In contrast to SFT, where an LLM is only exposed to positive examples (e.g. high-quality demonstration data), RLHF makes it possible to also leverage negative outputs thus penalizing an LLM when it generates responses that exhibit undesired properties. Penalizing negative output makes it less likely to generate unhelpful or unsafe responses.
To leverage RLHF, a reward model (RM) typically needs to be trained. An RM is usually initialized with a pretrained transformer model, often also one that is SFT. Then it is tuned on human preference data which is either single sided (with a prompt, response and a score) or composed of a prompt and a pair of responses along with a preference label indicating which of the two responses was preferred.
Both SFT and RLHF are still very costly in terms of compute time and accelerators required, especially when full-fine tuning entire LLMs on the orders of billions of parameters.
Luckily, there are some really useful and effective techniques that can make fine-tuning significantly cheaper and faster compared to pre-training and full fine-tuning. One such family of methods is parameter efficient fine-tuning (PEFT) techniques.
At a high-level, PEFT approaches append a significantly smaller set of weights (e.g., on the order of thousands of parameters) that are used to ‘perturb’ the pre-trained LLM weights. The perturbation has the effect of fine-tuning the LLM to perform a new task or set of tasks. This has the benefit of training a significantly smaller set of weights, compared to traditional fine-tuning of the entire model.
Few-shot prompting: This is when you provide the LLM with a task description, as well as a few (e.g. three to five) carefully chosen examples, that will help guide the LLM’s response. For example, you might provide the model with the name of a few countries and their capital cities, then ask it to generate the capital for a new country that isn’t in the examples.
Zero-shot prompting: This is when you provide the LLM directly with a prompt with instructions. You usually give the LLM a task description and the LLM relies heavily on its existing knowledge to output the correct response. This requires no additional data or examples, hence the name ‘Zero-shot’ but can be less reliable than few-shot prompting.
Chain-of-thought prompting: This technique aims to improve performance on complex reasoning tasks. Rather than simply asking the LLM a question, you provide a prompt that demonstrates how to solve similar problems using step-by-step reasoning. The LLM then generates its own chain of thought for the new problem, breaking it down into smaller steps and explaining its reasoning. Finally, it provides an answer based on its reasoning process.
A variety of sampling techniques can be employed to determine how the model chooses the next token in a sequence. They are essential for controlling the quality, creativity, and diversity of the LLM’s output.
Greedy search: Selects the token with the highest probability at each step. This is the simplest option but it can lead to repetitive and predictable outputs.
Random sampling: Selects the next token according to the probability distribution, where each token is sampled proportionally to its predicted probability. This can produce more surprising and creative text, but also a higher chance of nonsensical output.
Temperature sampling: Adjusts the probability distribution by a temperature parameter. Higher temperatures promote diversity, lower temperatures favor high-probability tokens.
Top-K sampling: Randomly samples from the top K most probable tokens. The value of K controls the degree of randomness.
Top-P sampling (nucleus sampling): Samples from a dynamic subset of tokens whose cumulative probability adds up to P. This allows the model to adapt the number of potential candidates depending on its confidence, favoring more diversity when uncertain and focusing on a smaller set of highly probable words when confident.
Best-of-N sampling: Generates N separate responses and selects the one deemed best according to a predetermined metric (e.g., a reward model or a logical consistency check). This is particularly useful for short snippets or situations where logic and reasoning are key.
By combining prompt engineering with sampling techniques and correctly calibrated hyperparameters, you can greatly influence the LLM’s response, making it more relevant, creative, and consistent for your specific needs.
Language models have been consistently increasing in size and this has been a direct contributor to the vast improvement in these models’ quality and accuracy over the last few years.
As increasing the number of parameters has improved the quality of LLMs it has also increased the computational resources needed to run them.
One important distinction when approaching this trade-off is between the theoretical possibility of a quality loss versus the practical capability of the model to perform the desired task. This is use case specific and exploring it will often lead to significant speedups without sacrificing quality in a meaningful or noticeable way. For example, if the task we want the model to perform is simple, then a smaller model or a quantised one will likely be able to perform this task well. Reduction in parametric capacity or precision does not automatically mean that the model is less capable at that specific task.
LLMs are fundamentally composed of multiple numerical matrices (a.k.a the model weights). During inference, matrix operations are then applied to these model weights to produce numerical outputs (a.k.a activations).
Quantization is the process of decreasing the numerical precision in which weights and activations are stored, transferred and operated upon. The default representation of weights and activations is usually 32 bits floating numbers, with quantization we can drop the precision to 8 or even 4 bit integers.
Quantization’s impact on quality can be very mild to non-existent depending on the use case and model. Further, in cases where quantisation might introduce a quality regression, that regression can be small compared to the performance gain, therefore allowing for an effective Quality vs Latency/Cost Tradeoff.
Quantization can be either applied as an inference-only operation, or it can be incorporated into the training (referred to as Quantisation Aware Training QAT). QAT is generally considered to be a more resilient approach as the model is able to recover some of the quantisation-related quality losses during training.
Distillation is a set of training techniques that targets improving the quality of a smaller model (the student) using a larger model (the teacher). This method can be effective because larger models outperform smaller ones even if both are trained on the same data, mainly due to parametric capacity and training dynamics.
Scaled Dot-product Attention, which is the predominant attention mechanism in the transformer architecture, is a quadratic operation on the input length. Optimizing the self-attention calculation can bring significant latency and cost wins.
Flash Attention, introduced in by Tri Dao et al., optimizes the attention calculation by making the attention algorithm IO Aware, particularly trying to minimize the amount of data we move between the slow HBM (high bandwidth memory) to the faster memory tier (SRAM/VMEM) in TPUs and GPUs.
One of the most compute intensive, and thus slowest, operations in LLM inference is calculating the attention key and value scores (a.k.a KV) for the input we’re passing to the LLM, this operation is often referred to as prefill. The final output of prefill is what is termed KV Cache which includes the attention key and value scores for each layer of the transformer for the entire input. This cache is vital during the decoding phase which produces the output tokens, the KV cache allows us to avoid recalculating attention scores for the input on each autoregressive decode step.
Prefix Caching refers to the process of caching the KV Cache itself between subsequent inference requests in order to reduce the latency and cost of the prefill operation. The way the self-attention mechanism works makes reusing KV caches possible because tokens will only pay attention to tokens that came before them in the sequence. If there’s new input being appended to input that the model has seen before, then we can potentially avoid recalculating the prefill for the older input.
Speculative decoding (Leviathan at al.) aims to overcome this limitation in decode by finding a way to utilize the spare compute capacity to make each decode step faster. The main idea is to use a much smaller secondary model (often referred to as the drafter) to run ahead of the main model and predict more tokens. (e.g. 4 tokens ahead). This will happen very quickly as the drafter is much faster and smaller than the main model. We then use the main model to verify the hypotheses of the drafter in parallel for each of the 4 steps (i.e. the first token, the first two tokens, the first 3 tokens and finally all 4 tokens), and we then select the accepted hypothesis with the maximum number of tokens.
One important condition for speculative decoding to work effectively is that the drafter model has good levels of alignment with the main model, otherwise we won’t be able to accept any of the tokens. So investing in the training quality of the drafter model is worthwhile to get better latencies.
The transformer architecture is the basis for all modern-day LLMs. Across the various architectures mentioned in this whitepaper we see that it’s important not only to add more parameters to the model, but the composition of the dataset is equally important.
The order and strategies used for fine-tuning is important and may include multiple steps such as Instruction Tuning, Safety Tuning, etc. Supervised Fine Tuning (SFT) is important in capturing the essence of a task. RLHF, and potentially RLAIF, can be used to shift the distribution from the pretraining distribution to a more desired one through the power of the reward function, that can reward desirable behaviors and penalize undesirable ones.
Making inference from neural models efficient is an important problem and an active field of research. Many methods exist to reduce serving costs and latency with minimal impact to model performance, and some exact acceleration methods guarantee identical model outputs.
Large language models can be used for a variety of tasks including summarization, translation, question answering, chat, code generation, and many more. You can create your own tasks using the Vertex and Makersuite text generation services which leverage Google’s latest language models. After the model has been trained and tuned, it is important to experiment with engineering prompts. You should use the technique most appropriate for the task-at-hand because LLMs can be sensitive to prompts k. Furthermore, it is also possible to enhance task specific performance or creativity and diversity by tweaking the parameters corresponding to sampling techniques such as Top-K, Top-P, and Max decoding steps to find the optimum mix of correctness, diversity, and creativity required for the task at hand.
%% end annotations %%
%% Import Date: 2024-11-13T09:07:05.422-07:00 %%