nawalgaria2024

## Embeddings & Vector Stores > [!Abstract]- > Modern machine learning thrives on diverse data—images, text, audio, and more. This whitepaper explores the power of embeddings, which transform this heterogeneous data into a unified vector representation for seamless use in various applications. We'll guide you through: Understanding Embeddings: Why they are essential for handling multimodal data and their diverse applications. Embedding Techniques: Methods for mapping different data types into a common vector space. Efficient Management: Techniques for storing, retrieving, and searching vast collections of embeddings. Vector Databases: Specialized systems for managing and querying embeddings, including practical considerations for production deployment. Real-World Applications: Concrete examples of how embeddings and vector databases are combined with large language models (LLMs) to solve real-world problems. Throughout the whitepaper, code snippets provide hands-on illustrations of key concepts. > [!Cite]- > Nawalgaria, Anant, and Xiaoqi Ren. “Embeddings & Vector Stores,” September 1, 2024. [https://www.kaggle.com/whitepaper-embeddings-and-vector-stores](https://www.kaggle.com/whitepaper-embeddings-and-vector-stores). > > [link](https://www.kaggle.com/whitepaper-embeddings-and-vector-stores) [online](http://zotero.org/users/local/kycSZ2wR/items/9MVU9KCH) [local](zotero://select/library/items/9MVU9KCH) [pdf](file://C:\Users\erikt\Zotero\storage\86CI2CVG\Nawalgaria%20and%20Ren%20-%20Embeddings%20&%20Vector%20Stores.pdf) ## Notes %% begin notes %% %% end notes %% %% begin annotations %% ### Imported: 2024-11-13 2:23 pm In essence, embeddings are numerical representations of real-world data such as text, speech, image, or videos. They are expressed as low-dimensional vectors where the geometric distances of two vectors in the vector space is a projection of the relationships between the two real-world objects that the vectors represent. These low-dimensional numerical representations of real-world data significantly helps efficient large-scale data processing and storage by acting as means of lossy compression of the original data while retaining its important properties. One of the key applications for embeddings is retrieval and recommendations, where the result is usually from a massive search space. For example, Google Search is a retrieval with the search space of the whole internet. An embedding refers to the projected vector of an object from an input space to a relatively low-dimensional vector space. Each vector is a list of floating point numbers. The embeddings can then be used as a condensed, meaningful input in downstream applications. For example, you can use them as features for ML models, recommender systems, search engines, and many more. So your data not only gets a compact numerical representation, but this representation also preserves the semantic meanings for a specific task or across a variety of tasks. The fact that these representations are task-specific means you can generate different embeddings for the same object, optimized for the task at hand. In this section, you’ll see a few word embedding techniques and algorithms to both train and use word embeddings. While there are many ML driven algorithms developed over time optimized for different objectives, the most common ones are GloVe,2 SWIVEL,3 and Word2Vec. Word2Vec is a family of model architectures that operates on the principle of “the semantic meaning of a word is defined by its neighbors”, or words that frequently appear close to each other in the training corpus. GloVe is a word embedding technique that leverages both global and local statistics of words. It does this by first creating a co-occurrence matrix, which represents the relationships between words. GloVe then uses a factorization technique to learn word representations from the co-occurrence matrix. The resulting word representations are able to capture both global and local information about words, and they are useful for a variety of NLP tasks. In addition to GloVE, SWIVEL is another approach which leverages the co-occurrence matrix to learn word embeddings. SWIVEL stands for Skip-Window Vectors with Negative Sampling. Unlike GloVE, it uses local windows to learn the word vectors by taking into account the co-occurrence of words within a fixed window of its neighboring words. Furthermore, SWIVEL also considers unobserved co-occurrences and handles it using a special piecewise loss, boosting its performance with rare words. It is generally considered only slightly less accurate than GloVe on average, but is considerably faster to train. Word embeddings can be directly used in some downstream tasks like Named Entity Recognition (NER). Inspired by Word2Vec, Doc2Vec10 was proposed in 2014 for generating document embeddings using (shallow) neural networks. Motivated by the development of deep neural networks, different embedding models and techniques were proposed, and the state-of-the-art models are refreshed frequently. New embedding models based on large language models have been proposed. For example, GTR and Sentence-T5 show better performance on retrieval and sentence similarity (respectively) than BERT family models. Another approach to new embeddings models development is generating multi-vector embeddings instead of a single vector to enhance the representational power of the models. Embedding models in this family include ColBERT15 and XTR. Much like text, it’s also possible to create both image and multimodal embeddings. Unimodal image embeddings can be derived in many ways: one of which is by training a CNN or Vision Transformer model on a large scale image classification task (for example, Imagenet), and then using the penultimate layer as the image embedding. This layer has learnt some important discriminative feature maps for the training task. It contains a set of feature maps that are discriminative for the task at hand and can be extended to other tasks as well. To obtain multimodal embeddings19 you take the individual unimodal text and image embeddings and their semantic relationships learnt via another training process. This gives you a fixed size semantic representation in the same latent space. Unlike unstructured data, where a pre-trained embedding model is typically available, we have to create the embedding model for the structured data since it would be specific to a particular application. Given a general structured data table, we can create embedding for each row. This can be done by the ML models in the dimensionality reduction category, such as the PCA model. One use case for these embeddings are for anomaly detection. For example, we can create embeddings for anomaly detection using large data sets of labeled sensor information that identify anomalous occurrences.20 Another case use is to feed these embeddings to downstream ML tasks such as classification. Compared to using the original highdimensional data, using embeddings to train a supervised model requires less data. This is particularly important in cases where training data is not sufficient. Graph embeddings are another embedding technique that lets you represent not only information about a specific object but also its neighbors (namely, their graph representation). Take an example of a social network where each person is a node, and the connections between people are defined as edges. Using graph embedding you can model each node as an embedding, such that the embedding captures not only the semantic information about the person itself, but also its relations and associations hence enriching the embedding. Graph embeddings can also be used for a variety of tasks, including node classification, graph classification, link prediction, clustering, search, recommendation systems, and more. Popular algorithms for graph embedding include DeepWalk, Node2vec, LINE, and GraphSAGE. Similar to foundation model training, training of an embedding model from scratch usually includes two stages: pretraining (unsupervised learning) and fine tuning (supervised learning). Nowadays, the embedding models are usually directly initialized from foundation models such as BERT, T5, GPT, Gemini, CoCa. Full-text keyword search has been the lynchpin of modern IT systems for years. Full-text search engines and databases (relational and non-relational) often rely on explicit keyword matching. There are traditional approaches which are tolerant of misspellings and other typographical errors. However, they are still unable to find the results having the closest underlying semantic meanings to the query. This is where vector search is very powerful: it uses the vector or embedded semantic representation of documents. Vector search lets you to go beyond searching for exact query literals and allows you to search for the meaning across various data modalities. Euclidean distance (i.e., L2 distance) is a geometric measure of the distance between two points in a vector space. This works well for lower dimensions. Cosine similarity is a measure of the angle between two vectors. And inner/dot product, is the projection of one vector onto another. They are equivalent when the vector norms are 1. This seems to work better for higher dimensional data. Vector databases store and help manage and operationalize the complexity of vector search at scale, while also addressing the common database needs. The most straightforward way to find the most similar match is to run a traditional linear search by comparing the query vector with each document vector and return the one with the highest similarity. However, the runtime of this approach scales linearly (O(N)) with the amount of documents or items to search. Using approximate nearest neighbor (ANN) search for that purpose is more practical. ANN is a technique for finding the closest points to a given point in a dataset with a small margin of error - but with a tremendous boost in performance. There are many approaches with varying trade-offs across scale, indexing time, performance, simplicity and more. Locality sensitive hashing (LSH) is a technique for finding similar items in a large dataset. It does this by creating one or more hash functions that map similar items to the same hash bucket with high probability. This means that you can quickly find all of the similar items to a given item by only looking at the candidate items in the same hash bucket (or adjacent buckets) and do a linear search amongst those candidate pairs. Tree-based algorithms work similarly. For example, the Kd-tree approach works by creating the decision boundaries by computing the median of the values of the first dimension, then that of the second dimension and so on. This approach is very much like a decision tree. Naturally this can be ineffective if searchable vectors are high dimensional. In that case, the Ball-tree algorithm is better suited. It is similar in functionality, except instead of going by dimension-wise medians it creates buckets based on the radial distance of the data points from the center. One of the FAISS (Facebook AI similarity search) implementations leverages the concept of hierarchical navigable small world (HNSW) to perform vector similarity search in sublinear ($O(\log n)$) runtime with a good degree of accuracy. A HNSW is a proximity graph with a hierarchical structure where the graph links are spread across different layers. The top layer has the longest links and the bottom layer has the shortest ones. Google developed the scalable approximate nearest neighbor (ScaNN) approach which is used across a lot of its products and services. In this whitepaper we have seen both State-of-the-Art SOTA and traditional ANN search algorithms: ScaNN, FAISS , LSH, KD-Tree, and Ball-tree, and examined the great speed/ accuracy tradeoffs that they provide. However, to use these algorithms they need to be deployed in a scalable, secure and production-ready manner. For that we need vector databases. Some databases might provide caching and pre-filtering (based on tags) and post-filtering capabilities (reranking using another more accurate model) to further enhance the query speed and performance. A few good examples of commercially managed vector databases include Google Cloud’s Vertex Vector Search, Google Cloud’s AlloyDB & Cloud SQL Postgres ElasticSearch, and Pinecone to name a few. Amongst their open source peers: Weaviate and ChromaDB provide a full suite of functionality upon deployment and can be tested in memory as well during the prototyping phase. Firstly, embeddings, unlike traditional content, can mutate over time. This means that the same text, image, video or other content could and should be embedded using different embedding models to optimize for the performance of the downstream applications. However, frequently updating the embeddings - especially those trained on large amounts of data - can be prohibitively expensive. Consequently, a balance needs to be struck. This necessitates a well-defined automated process to store, manage, and possibly purge embeddings from the vector databases taking the budget into consideration. Secondly, while embeddings are great at representing semantic information, sometimes they can be suboptimal at representing literal or syntactic information. This is especially true for domain-specific words or IDs. These values are potentially missing or underrepresented in the data the embeddings models were trained on. Another important point to consider is that depending on the nature of the workload in which the semantic query occurs, it might be worth relying on different vector databases. For example, for OLTP workloads that require frequent reads/write operations, an operational database like Postgres or CloudSQL is the best choice. For large-scale OLAP analytical workloads and batch use cases, using Bigquery’s vector search is preferable. In conclusion, a variety of factors need to be considered when choosing a vector database. These factors include size and type of your dataset (some are good at sparse and others dense), business needs, the nature of the workload, budget, security, privacy guarantees, the needs for semantic and syntactic search as well as the database systems that are already in use. Retrieval augmented generation (RAG) for Q&A is a technique that combines the best of both worlds from retrieval and generation. It first retrieves relevant documents from a knowledge base and then uses prompt expansion to generate an answer from those documents. Prompt expansion is a technique that when combined with database search can be very powerful. With prompt expansion the model retrieves relevant information from the database (mostly using a combination of semantic search and business rules), and augments the original prompt with it. The model uses this augmented prompt to generate much more interesting, factual, and informative content than with retrieval or generation alone. Choose your embedding model wisely for your data and use case. Ensure the data used in inference is consistent with the data used in training. The distribution shift from training to inference can come from various areas, including domain distribution shift or downstream task distribution shift. If no existing embedding models fit the current inference data distribution, fine-tuning the existing model can significantly help on the performance. Another tradeoff comes from the model size. The large deep neural network (large multimodal models) based models usually have better performance but can come with a cost of longer serving latency. Using Cloud-based embedding services can conquer the above issue by providing both high-quality and low-latency embedding service. For most business applications using a pre-trained embedding model provides a good baseline, which can be further fine-tuned or integrated in downstream models. In case the data has an inherent graph structure, graph embeddings can provide superior performance. Once your embedding strategy is defined, it’s important to make the choice of the appropriate vector database that suits your budget and business needs. It might seem quicker to prototype with available open source alternatives, but opting for a more secure, scalable, and battle-tested managed vector database is certain to be better off in the long term. There are various open source alternatives using one of the many powerful ANN vector search algorithms, but ScaNN and HNSW have proven to provide some of the best accuracy and performance trade offs in that order. Embeddings combined with an appropriate ANN powered vector database is an incredibly powerful tool and can be leveraged for various applications, including Search, Recommendation systems, and Retrieval augment generation for LLMs. This approach can mitigate the hallucination problem and bolster verifiability and trust of LLM-based systems. %% end annotations %% %% Import Date: 2024-11-13T14:23:33.428-07:00 %%