Beyond RAG: How cache-aided generation reduces latency and complexity for smaller workloads

Beyond RAG: How cache-aided generation reduces latency and complexity for smaller workloads


Search-assisted generation (RAG) has change into the de facto method to customize large language models (LLM) to extract bespoke information. However, RAG has an upfront technical cost and might be slow. Now, with advances in long-context LLM, corporations can bypass RAG by putting all proprietary information on the line.

AND recent study conducted at National Chengchi University in Taiwan shows that by using long-term LLM and caching techniques, you possibly can create custom applications that outperform RAG pipelines. This approach, called cache-assisted generation (CAG), might be a easy and effective alternative for RAG in enterprise settings where the knowledge corpus can fit into the context window of the model.

- Advertisement -

RAG Limitations

RAG is an effective method for handling open domain questions and specialized tasks. Uses search algorithms to assemble documents relevant to the request and adds context to enable LLM to create more accurate responses.

However, RAG introduces several restrictions to the LLM application. The added download step introduces delays that will degrade the user experience. The result also depends on the quality of the selection stage and document rating. In many cases, the limitations of the models used for search require that documents be broken down into smaller parts, which may harm the search process.

Generally, RAG increases the complexity of LLM applications by requiring the development, integration, and maintenance of additional components. The added overhead slows down the development process.

Cache-assisted downloads

An alternative to developing a RAG pipeline is to insert the entire document corpus into the prompt and allow the model to pick bits relevant to the request. This approach eliminates the complexity of the RAG pipeline and the problems caused by fetch errors.

However, loading all documents into a prompt presents three key challenges. First, long hints decelerate the model and increase inference costs. Second, the length of the LLM context window limits the variety of documents that fit in the prompt. Finally, adding irrelevant information to the prompts can confuse the model and reduce the quality of its responses. Therefore, simply placing all documents on the command line, quite than choosing the most appropriate ones, may ultimately degrade model performance.

To overcome these challenges, the proposed CAG approach leverages three key trends.

First, advanced caching techniques make processing hint templates faster and cheaper. The idea behind CAG is that knowledge documents will likely be included in every prompt sent to the model. Therefore, you possibly can calculate the attention value of their tokens in advance, as an alternative of doing it when receiving requests. This pre-calculation reduces the processing time of user requests.

Leading LLM providers equivalent to OpenAI, Anthropic, and Google provide features for quickly caching repeated portions of a prompt, which can include content documents and instructions inserted at the starting of the prompt. With Anthropic, you possibly can reduce costs by as much as 90% and latency by 85% for cached portions of the prompt. Equivalent caching features have been developed for open source LLM hosting platforms.

Secondly, long-context LLMs make it easier to suit more documents and knowledge into the prompts. Claude 3.5 Sonnet supports as much as 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini supports as much as 2 million tokens. Thanks to this, you possibly can include many documents or entire books in the prompt.

Finally, advanced learning methods enable models to higher search, reason, and answer questions in very long sequences. Last 12 months, researchers developed several LLM benchmarks for long-sequence tasks, including: BABILlong, ICL long benchAND LINE. These benchmarks test the LLM on difficult problems equivalent to multiple retrievals and answering multi-hop questions. There is still much work to be done in this field, but artificial intelligence labs proceed to make progress.

As newer generations of models proceed to expand their context windows, they’ll have the ability to process larger sets of data. Moreover, we will expect models to proceed to enhance their ability to extract and use relevant information from long contexts.

“These two trends will significantly increase the usability of our approach, enabling it to support more complex and diverse applications,” the researchers write. “As a result, our methodology is well-positioned to become a robust and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of next-generation LLMs.”

RAG vs. CAG

To compare RAG and CAG, researchers conducted experiments on two widely known benchmarks: Teamwhich focuses on contextual questions and answers from single documents, and HotPotQAwhich requires inference based on multiple hops across multiple documents.

They used the Llama-3.1-8B model with a context window containing 128,000 tokens. For RAG, LLM was combined with two retrieval systems to acquire snippets relevant to the query: primary BM25 algorithm and OpenAI embedding. For CAG, they inserted multiple documents from the benchmark into the prompt and let the model itself determine which pieces to make use of to reply the query. Their experiments show that CAG outperformed each RAG systems in most situations.

“By preloading all context from the test set, our system eliminates search errors and provides holistic understanding of all relevant information,” the researchers write. “This advantage is particularly evident in scenarios where RAG systems may recover incomplete or irrelevant fragments, leading to suboptimal response generation.”

CAG also significantly reduces response generation time, especially as the length of the reference text increases.

That said, CAG is not a silver bullet and must be used with caution. Well suited to settings where the knowledge base does not change often and is sufficiently small to suit in the model’s context window. Companies must also watch out when their documents contain contradictory facts resulting from the context of the documents, which could confuse the model during inference.

The best method to check if CAG is good for your use case is to run some experiments. Fortunately, implementing CAG is quite simple and should at all times be treated as the first step before investing in more development-requiring RAG solutions.

Latest Posts

Advertisement

More from this stream

Recomended