Transformers were introduced by Vaswani et al. (2017) in the now famous paper *Attention is All You Need*. The transformer improves upon convolutions and recurrence by supporting parallelization for the [[neural network]]. The original application of the transform architecture was for [[machine translation]]. Due to better performance and parallelization, the transformer has been adapted to many types of tasks. Transformers employ [[attention]] to