The attention mechanism is a technique for weighting inputs according to their importance for the task at hand. Attention mechanisms are often characterized as hard or soft attention. Soft attention is the most common attention mechanism that uses a soft masking of the inputs. Soft attention is [[differentiable]] which allows for training with backpropagation. Hard attention selects a subset of inputs but is not differentiable so training is done with gradient estimation. Another way to characterize attention mechanism is entire versus partial input. Global attention uses all inputs. Local attention considers only the local neighborhood. Attention can also be used to incorporate external information, for example from memory or a knowledge base. The attention mechanism, specifically the query-key-value (QKV) paradigm, is used in models like the [[transformer]] to capture dependencies between words or tokens in a sequence. For example, when we read a word “eat”, we may wonder “what is eaten” and look for other words that are “edible.” Here's how that relates to the QKV paradigm in attention: - **Query**: This represents the item you're currently focused on or considering. In the example given, when reading "eat", the question or the context formed in the mind is "what is eaten?". - **Key**: The keys represent all the possible items you could be considering in relation to the query. In this case, the contextual understanding of something being "edible" relates to the action of "eat". - **Value**: The values are the actual items or details you retrieve based on how well the keys matched the query. If "edible" matches well with "what is eaten?", then you retrieve or focus on the actual word or concept, which is "eat" in this case. In [[transformer]] architecture, attention is used to group related words or tokens. These are provided into the feed forward network (FFN) to generate new features from the grouped tokens. This is repeated many times, for example ChatGPT has 92 such layers. # additive attention (Bahdanau) # Scaled dot product attention (Vaswani) # multi-head attention Allows the model to focus on different parts of the input simultaneously Uses scaled dot product attention ## masked-self attention Decoder-only models employ masked self-attention during training to ensure that the prediction of a token only depends on the tokens that have been generated up to that point (as it would during inference). # attention scoring functions [[cross-attention]] self-attention ## key papers Bahdanau 2015: attention Vanswani 2017: transformer Radford 2018 (2019, 2020 on GPT) Devlin 2018: BERT > [!Tip]- Additional Resources > - [Attention for Neural Networks, Clearly Explained!!!](https://youtu.be/PSs6nxngL6k?si=dD1S_xF9fZpTKfZ2) | StatQuest with Josh Starmer