The new technique makes RAG systems much better at finding the right documents

Search-assisted generation (RAG) has develop into a popular method for embedding large language models (LLM) in external knowledge. RAG systems typically use an embedding model to encode documents into the knowledge corpus and select those who are most relevant to the user’s query.

However, standard search methods often do not take into account context-specific details that may be vital in application-specific datasets. In a new article, scientists from Cornell University to introduce “contextual document embedding”, a technique that improves model embedding performance by making them aware of the context in which documents are retrieved.

- Advertisement -

Limitations of bi-encoders

The most typical approach to retrieving documents in RAG is to make use of “bi-encoders”, where the embedding model creates a fixed representation of each document and stores it in a vector database. During inference, query embeddings are calculated and in comparison with stored embeddings to seek out the most relevant documents.

Bi-encoders have develop into a popular alternative for document retrieval in RAG systems attributable to their performance and scalability. However, bi-encoders often struggle with refined, application-specific datasets because they are trained on generic data. In fact, when it involves specialized bodies of data, they might differ from classical statistical methods corresponding to BM25 in some tasks.

“Our project began with the study of BM25, an old text mining algorithm,” John (Jack) Morris, a graduate student at Cornell Tech and co-author of the paper, told VentureBeat. “We did a little analysis and saw that the more the dataset is outside the domain, the more BM25 outperforms neural networks.”

BM25 achieves its flexibility by calculating the weight of each word in the context of the indexed corpus. For example, if a word appears in many knowledge corpus documents, its weight will likely be reduced, even if it is an vital keyword in other contexts. This allows BM25 to adapt to the specific characteristics of various data sets.

“Traditional neural network-based dense search models can’t do this because they simply set weights once based on training data,” Morris said. “We tried to develop an approach that could solve this problem.”

Contextual document embedding

Cornell researchers propose two complementary methods to enhance the performance of bi-encoders by adding the concept of context to document embeddings.

“If you think of search as a ‘competition’ between documents to see which one is most relevant to a given query, we use ‘context’ to tell the coder about the other documents that will be in the competition,” Morris said.

The first method modifies the means of training the embedding model. Researchers use a technique to cluster similar documents before training an embedding model. They then use contrastive learning to coach the encoder to differentiate between the documents in each cluster.

Contrastive learning is an unsupervised technique in which a model learns to differentiate between positive and negative examples. Forced to differentiate between similar documents, the model becomes more sensitive to subtle differences that are vital in specific contexts.

The second method modifies the bi-encoder architecture. Scientists add a mechanism to the encoder that enables it to access the corpus during the embedding process. This allows the encoder to take into account the context of the document when generating its embedding.

The prolonged architecture works in two steps. First, it calculates the common embedding for the cluster to which the document belongs. It then combines this shared embedding with the document’s unique features to create a contextual embedding.

This approach allows the model to capture each the overall context of the document group and the specific details that make it unique. The output is still embedded at the same size as a regular bi-encoder, so it does not require any changes to the download process.

The impact of contextual document embedding

The researchers evaluated their method on various benchmarks and found that it consistently outperformed standard bi-encoders of comparable size, especially in out-of-domain settings where the training and test datasets differ significantly.

“Our model should be useful in any domain that differs significantly from the training data and can be thought of as a low-cost replacement for tuning domain-specific embedding models,” Morris said.

Contextual embeddings may be used to enhance the performance of RAG systems in various domains. For example, if all documents have the same structure or context, a normal embedding model would waste space in the embedding by storing this redundant structure or information.

“Contextual embeddings, on the other hand, allow you to see based on the surrounding context that the information being shared is not useful and discard it before deciding what exactly to keep in the embedding,” Morris said.

The researchers published a small version of their context-aware document embedding model (cde-small-v1). It may be used as a alternative for popular open source tools corresponding to HuggingFace and SentenceTransformers to create custom embeddings for various applications.

Morris argues that contextual embedding is not limited to text models and may be prolonged to other modalities, corresponding to text-to-image architectures. They may also be improved with more advanced clustering algorithms and the effectiveness of this technique assessed at larger scales.

VB every day

Stay up up to now! Get the latest news in your inbox every day

By subscribing, you conform to the VentureBeat Terms of Service.

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

AI startups, note: how to patent technology in the Alice era

Stop trying to be another unicorn – and start doing it

The most interesting startups presented in Google Cloud Next

I did not realize that my parents taught me a money tip, that I was sabotizing me – until I founded the company

Why attempts to find a goal delay your success

Intelligent marketers use these 4-stage frames for each e-mail campaign

A good product design is more than aesthetics – how to balance with practical to attract more investors

How to build a brand that an ultra-uncompromising one cannot resist

How to use a story story to raise b2b marketing

I employ 75 people in 10 countries – here are 3 skills that helped me build my global team

Hustle behind the hit “Novocaine”

Great ideas do not scale – but these 8 steps

Stress related to leadership is growing. Here’s how to fight.

Keep your best talent for these 3 secrets of stopping employees

How racing quick winnings can sabotage your company’s success

Spacetech Startup Funding Funding on a new course

Q1 Global Startup Funding will publish the strongest quarter from KW. 2 2022

Start funding is slowed down in February in connection with the uncertainty of the exit

The largest funding rounds of the week: Massive List of Saronic peaks

Nih funding uncertainty Spurs New Biotech Venture Fund

The new technique makes RAG systems much better at finding the right documents

Limitations of bi-encoders

Contextual document embedding

The impact of contextual document embedding

Latest Posts

Hustle behind the hit “Novocaine”

AI startups, note: how to patent technology in the Alice era

Stop trying to be another unicorn – and start doing it

The most interesting startups presented in Google Cloud Next

Deepcoder ensures the highest coding efficiency in the efficient 14b open...

Google introduces Firebase Studio, a comprehensive platform that builds custom applications...

Deepseek will present a new technique of smarter, scalable models of...

Reburn’s La Quimera FPS debuts on Steam on April 25

Recomended

Hustle behind the hit “Novocaine”

AI startups, note: how to patent technology in the Alice era

Stop trying to be another unicorn – and start doing it

The most interesting startups presented in Google Cloud Next

I did not realize that my parents taught me a money tip, that I was sabotizing me – until I founded the company

Intelligent marketers use these 4-stage frames for each e-mail campaign