Beyond RAG: How cache-aided generation reduces latency and complexity for smaller workloads

Search-assisted generation (RAG) has change into the de facto method to customize large language models (LLM) to extract bespoke information. However, RAG has an upfront technical cost and might be slow. Now, with advances in long-context LLM, corporations can bypass RAG by putting all proprietary information on the line.

AND recent study conducted at National Chengchi University in Taiwan shows that by using long-term LLM and caching techniques, you possibly can create custom applications that outperform RAG pipelines. This approach, called cache-assisted generation (CAG), might be a easy and effective alternative for RAG in enterprise settings where the knowledge corpus can fit into the context window of the model.

- Advertisement -

RAG Limitations

RAG is an effective method for handling open domain questions and specialized tasks. Uses search algorithms to assemble documents relevant to the request and adds context to enable LLM to create more accurate responses.

However, RAG introduces several restrictions to the LLM application. The added download step introduces delays that will degrade the user experience. The result also depends on the quality of the selection stage and document rating. In many cases, the limitations of the models used for search require that documents be broken down into smaller parts, which may harm the search process.

Generally, RAG increases the complexity of LLM applications by requiring the development, integration, and maintenance of additional components. The added overhead slows down the development process.

Cache-assisted downloads

An alternative to developing a RAG pipeline is to insert the entire document corpus into the prompt and allow the model to pick bits relevant to the request. This approach eliminates the complexity of the RAG pipeline and the problems caused by fetch errors.

However, loading all documents into a prompt presents three key challenges. First, long hints decelerate the model and increase inference costs. Second, the length of the LLM context window limits the variety of documents that fit in the prompt. Finally, adding irrelevant information to the prompts can confuse the model and reduce the quality of its responses. Therefore, simply placing all documents on the command line, quite than choosing the most appropriate ones, may ultimately degrade model performance.

To overcome these challenges, the proposed CAG approach leverages three key trends.

First, advanced caching techniques make processing hint templates faster and cheaper. The idea behind CAG is that knowledge documents will likely be included in every prompt sent to the model. Therefore, you possibly can calculate the attention value of their tokens in advance, as an alternative of doing it when receiving requests. This pre-calculation reduces the processing time of user requests.

Leading LLM providers equivalent to OpenAI, Anthropic, and Google provide features for quickly caching repeated portions of a prompt, which can include content documents and instructions inserted at the starting of the prompt. With Anthropic, you possibly can reduce costs by as much as 90% and latency by 85% for cached portions of the prompt. Equivalent caching features have been developed for open source LLM hosting platforms.

Secondly, long-context LLMs make it easier to suit more documents and knowledge into the prompts. Claude 3.5 Sonnet supports as much as 200,000 tokens, while GPT-4o supports 128,000 tokens and Gemini supports as much as 2 million tokens. Thanks to this, you possibly can include many documents or entire books in the prompt.

Finally, advanced learning methods enable models to higher search, reason, and answer questions in very long sequences. Last 12 months, researchers developed several LLM benchmarks for long-sequence tasks, including: BABILlong, ICL long benchAND LINE. These benchmarks test the LLM on difficult problems equivalent to multiple retrievals and answering multi-hop questions. There is still much work to be done in this field, but artificial intelligence labs proceed to make progress.

As newer generations of models proceed to expand their context windows, they’ll have the ability to process larger sets of data. Moreover, we will expect models to proceed to enhance their ability to extract and use relevant information from long contexts.

“These two trends will significantly increase the usability of our approach, enabling it to support more complex and diverse applications,” the researchers write. “As a result, our methodology is well-positioned to become a robust and versatile solution for knowledge-intensive tasks, leveraging the growing capabilities of next-generation LLMs.”

RAG vs. CAG

To compare RAG and CAG, researchers conducted experiments on two widely known benchmarks: Teamwhich focuses on contextual questions and answers from single documents, and HotPotQAwhich requires inference based on multiple hops across multiple documents.

They used the Llama-3.1-8B model with a context window containing 128,000 tokens. For RAG, LLM was combined with two retrieval systems to acquire snippets relevant to the query: primary BM25 algorithm and OpenAI embedding. For CAG, they inserted multiple documents from the benchmark into the prompt and let the model itself determine which pieces to make use of to reply the query. Their experiments show that CAG outperformed each RAG systems in most situations.

“By preloading all context from the test set, our system eliminates search errors and provides holistic understanding of all relevant information,” the researchers write. “This advantage is particularly evident in scenarios where RAG systems may recover incomplete or irrelevant fragments, leading to suboptimal response generation.”

CAG also significantly reduces response generation time, especially as the length of the reference text increases.

That said, CAG is not a silver bullet and must be used with caution. Well suited to settings where the knowledge base does not change often and is sufficiently small to suit in the model’s context window. Companies must also watch out when their documents contain contradictory facts resulting from the context of the documents, which could confuse the model during inference.

The best method to check if CAG is good for your use case is to run some experiments. Fortunately, implementing CAG is quite simple and should at all times be treated as the first step before investing in more development-requiring RAG solutions.

Daily insight into business use cases with VB Daily

If you should impress your boss, VB Daily will allow you to do just that. We offer you insight into what corporations are doing with generative AI, from regulatory developments to practical implementations, so you possibly can share your insights for maximum return on your investment.

Read our Privacy Policy

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

4 business books every entrepreneur should read

Is taking over a business right for you? Here’s how to find out whether you should buy a business or start from scratch

3 ways to build a business in an unchallenged market

Biggest funding rounds this week: Biotech and space technology are making money

Insight Partners has raised $12.5 billion in new funding

Eat a mindful meal at Immigrant Food in Washington

Your local Ace gear gets a new look – here’s why

How to create marketing campaigns that reach multiple generations

How converting my company to employee ownership saved our culture and increased our success

Phaneesh Murthy’s plan to build long-term customer relationships in IT services

Inside my stay in a $22,000 villa: the best business lessons

Why investors need emotional strength more than a diversified portfolio

Leadership tips from a military veteran and former KFC executive

Meet the leaders of the Big Four: EY, Deloitte, PwC and KPMG

How to use travel to increase your leadership IQ

Biggest funding rounds this week: Biotech and space technology are making money

Insight Partners has raised $12.5 billion in new funding

Crunchbase Unicorn management has raised $1 trillion in funding

Breaking the funding madness: how to attract reporters with your story

The investment market is more competitive than ever – here’s how startups can still secure funding

Beyond RAG: How cache-aided generation reduces latency and complexity for smaller workloads

RAG Limitations

Cache-assisted downloads

RAG vs. CAG

Latest Posts

4 business books every entrepreneur should read

Former Zillow executives are targeting a $1,300 market. dollars

Is taking over a business right for you? Here’s how to...

Here’s why Gen Z loves the new Dunkin’ store

Artificial intelligence accelerates built-in finances and loyalty battles. Get Personal: Payment...

Luma AI releases Ray2 generative video model with “fast, natural” motion...

MiniMax introduces its own open source LLM with industry-leading 4M token...

LlamaV-o1 is an AI model that explains its thought process –...

Recomended

4 business books every entrepreneur should read

Former Zillow executives are targeting a $1,300 market. dollars

Is taking over a business right for you? Here’s how to find out whether you should buy a business or start from scratch

Here’s why Gen Z loves the new Dunkin’ store

3 ways to build a business in an unchallenged market

Biggest funding rounds this week: Biotech and space technology are making money