Cross-attention is the mechanism that links the [[encoder]] and [[decoder]] layers in the [[transformer]]-based [[encoder-decoder model|encoder-decoder architecture]] Cross-attention employs a Query (Q), Key (K), and Value (V) mechanism. The queries come from the decoder's previous output. The keys and values come from the encoder's output. This allows the decoder to pay "attention" to specific parts of the encoder's output based on what it has seen so far in the decoder. The attention mechanism computes how much focus should be given to each toekn in the encoder's output for each token in the decoder's current input using [[scaled dot product attention]]. The result is a weighted sum of the encoder's outputs which is passed through the decoder. The encoder produces embeddings $E = \{ e_1, e_2, \dots, e_n\}$. The decoder generates queries $Q$ based on the previous decoder output. The keys $K$ and values $V$ are taken from the encoder output $E$. The attention score is commuted with [[softmax]] as $\text{Attention(Q,K,V)} = \text{softmax}\Big (\frac{QK^T}{\sqrt{d_k}}\Big)V$ where $d_k$ is the dimension of the key vectors. This results in an output sequence from the decoder that takes into account both the previous decoder tokens and the encoder's context. After processing the cross-attention and self-attention in the decoder, the final output is passed through a linear layer (and typically a softmax layer) to generate predictions for the next token in the sequence. This output is used as the next input for the autoregressive process in sequence generation tasks (like translation or text generation).