word embeddings

Word embeddings are vector-based meaning representations relying on the [[distributional hypothesis]]. In **sparse** embeddings, words (or [[tokenization|tokens]]) are represented as a function of the counts of words they co-occur with, often at the document level. Vector length is based on the size of the collection. A **term-document matrix** provides the count of each term in a document. The term vector is simply the vector of counts per document in the corpus. In **dense** representations, the dimensionality is reduced. Representations are trained from larger corpora using semi-supervised learning based on co-occurrence statistics. Vector length ranges from 300 to 10,000. Word embeddings support - [[similarity measure]] - composition of meaning - relational and analogical reasoning - visualization and clustering methods ## latent semantic analysis Uses truncated SVD. Originally for query/document similarity ## word2vec ## GloVe GloVe, short for Global Vectors, is a method for [[word embeddings]] based on ratios of word co-occurrence probabilities. ## FastText FastText is a Implementations of FastText are available at [FastText](https://fasttext.cc/) and in the [Gensim](https://radimrehurek.com/gensim/auto_examples/tutorials/run_fasttext.html#sphx-glr-auto-examples-tutorials-run-fasttext-py) library. [[ELMo]] [[BERT]]