Bidirectional Encoder Representations from Transformers (BERT) is an [[encoder]]-only [[language model]] based on the [[transformer]] architecture. BERT generates [[word embeddings|embeddings]] for each subword token using its context. For example, the word "bank" in "river bank" and "financial bank" will have different representations.
BERT was named after the Sesame Street character, following on from the earlier [[ELMo]] model.
BERT was pretrained on a large corpus using [[masked language modeling]].
BERT can be [[fine tuning|fine tuned]] on a specific downstream task. For example, a text classification task would fine tune BERT by replacing the output layer. However, BERT is large and so variations include `DistilBERT` and `ALBERT` have been developed. Limitations of BERT include a 512 token context window. It is not a generative model, and so not suitable for text generation (see it's brother BART).
BERT was first introduced by Google Research in 2018.