Search-assisted generation (RAG) has develop into a popular method for embedding large language models (LLM) in external knowledge. RAG systems typically use an embedding model to encode documents into the knowledge corpus and select those who are most relevant to the user’s query.
However, standard search methods often do not take into account context-specific details that may be vital in application-specific datasets. In a new article, scientists from Cornell University to introduce “contextual document embedding”, a technique that improves model embedding performance by making them aware of the context in which documents are retrieved.
Limitations of bi-encoders
The most typical approach to retrieving documents in RAG is to make use of “bi-encoders”, where the embedding model creates a fixed representation of each document and stores it in a vector database. During inference, query embeddings are calculated and in comparison with stored embeddings to seek out the most relevant documents.
Bi-encoders have develop into a popular alternative for document retrieval in RAG systems attributable to their performance and scalability. However, bi-encoders often struggle with refined, application-specific datasets because they are trained on generic data. In fact, when it involves specialized bodies of data, they might differ from classical statistical methods corresponding to BM25 in some tasks.
“Our project began with the study of BM25, an old text mining algorithm,” John (Jack) Morris, a graduate student at Cornell Tech and co-author of the paper, told VentureBeat. “We did a little analysis and saw that the more the dataset is outside the domain, the more BM25 outperforms neural networks.”
BM25 achieves its flexibility by calculating the weight of each word in the context of the indexed corpus. For example, if a word appears in many knowledge corpus documents, its weight will likely be reduced, even if it is an vital keyword in other contexts. This allows BM25 to adapt to the specific characteristics of various data sets.
“Traditional neural network-based dense search models can’t do this because they simply set weights once based on training data,” Morris said. “We tried to develop an approach that could solve this problem.”
Contextual document embedding
Cornell researchers propose two complementary methods to enhance the performance of bi-encoders by adding the concept of context to document embeddings.
“If you think of search as a ‘competition’ between documents to see which one is most relevant to a given query, we use ‘context’ to tell the coder about the other documents that will be in the competition,” Morris said.
The first method modifies the means of training the embedding model. Researchers use a technique to cluster similar documents before training an embedding model. They then use contrastive learning to coach the encoder to differentiate between the documents in each cluster.
Contrastive learning is an unsupervised technique in which a model learns to differentiate between positive and negative examples. Forced to differentiate between similar documents, the model becomes more sensitive to subtle differences that are vital in specific contexts.
The second method modifies the bi-encoder architecture. Scientists add a mechanism to the encoder that enables it to access the corpus during the embedding process. This allows the encoder to take into account the context of the document when generating its embedding.
The prolonged architecture works in two steps. First, it calculates the common embedding for the cluster to which the document belongs. It then combines this shared embedding with the document’s unique features to create a contextual embedding.
This approach allows the model to capture each the overall context of the document group and the specific details that make it unique. The output is still embedded at the same size as a regular bi-encoder, so it does not require any changes to the download process.
The impact of contextual document embedding
The researchers evaluated their method on various benchmarks and found that it consistently outperformed standard bi-encoders of comparable size, especially in out-of-domain settings where the training and test datasets differ significantly.
“Our model should be useful in any domain that differs significantly from the training data and can be thought of as a low-cost replacement for tuning domain-specific embedding models,” Morris said.
Contextual embeddings may be used to enhance the performance of RAG systems in various domains. For example, if all documents have the same structure or context, a normal embedding model would waste space in the embedding by storing this redundant structure or information.
“Contextual embeddings, on the other hand, allow you to see based on the surrounding context that the information being shared is not useful and discard it before deciding what exactly to keep in the embedding,” Morris said.
The researchers published a small version of their context-aware document embedding model (cde-small-v1). It may be used as a alternative for popular open source tools corresponding to HuggingFace and SentenceTransformers to create custom embeddings for various applications.
Morris argues that contextual embedding is not limited to text models and may be prolonged to other modalities, corresponding to text-to-image architectures. They may also be improved with more advanced clustering algorithms and the effectiveness of this technique assessed at larger scales.