DeepMind's Michelangelo benchmark exposes the limitations of long-context LLMs

Large language models (LLMs) with very long context windows have been making headlines these days. The ability to cram tons of of 1000’s or even thousands and thousands of tokens onto a single line opens up many possibilities for developers.

But how well do these long-term LLMs really understand and use the vast amounts of information they receive?

- Advertisement -

Scientists from Google DeepMind they presented Michelangeloa recent benchmark designed to evaluate the ability to reason in a long-term LLM context. Their findings, published in a recent research paper, show that while current frontier models have made progress in retrieving information from large contextual data, they still struggle with tasks requiring inference about data structure.

Better long-term benchmarking is needed

The emergence of LLMs with extremely long context windows, ranging from 128,000 to over 1 million tokens, has led researchers to develop recent benchmarks to judge their capabilities. However, the best emphasis has been on search tasks, reminiscent of the popular “needle in a haystack” assessment, in which the model’s task is to search out specific information in a large context.

“Over time, the models have significantly improved long-term performance,” Kiran Vodrahalli, a researcher at Google DeepMind, told VentureBeat. “For example, the popular needle-in-a-haystack search method is now well saturated and covers extremely long contexts. Therefore, it has grow to be essential to find out whether the models are capable of solve tougher tasks in a short context and that the systems are also capable of solving over long distances.

Search tasks do not necessarily reflect the model’s ability to reason across context. The model may find a way to search out a specific fact without understanding the relationships between different parts of the text. Meanwhile, existing benchmarks that assess a model’s ability to reason in long contexts have limitations.

“It is easy to develop long-reasoning assessments that can be solved by combining only search and information stored in the model’s weights, which short-circuits the test of the model’s ability to use long context,” Vodrahalli said.

Michelangelo

To address the limitations of current standards, researchers introduced Michelangelo, “a minimal, synthetic, and unleaked assessment of long-context reasoning for large language models.”

Michelangelo relies on the analogy of a sculptor cutting out irrelevant pieces of marble to disclose the underlying structure. The benchmark focuses on assessing the model’s ability to know the relationships and structure of information in a context window, slightly than simply retrieving isolated facts.

The benchmark consists of three basic tasks:

Hidden list: The model must process the long sequence of operations performed on the Python list, filter out irrelevant or redundant statements, and determine the final state of the list. “The hidden means list measures a model’s ability to track properties of a hidden data structure over the course of a stream of code instructions,” the researchers write.

Multi-round co-reference resolution (MRCR): The model must create parts of a long conversation between the user and the LLM. This requires the model to know the structure of the conversation and resolve references to previous turns, even if the conversation comprises confusing or distracting elements. “MRCR measures the model’s ability to understand order in natural text, distinguish similar versions of writing, and recover a specific fragment of previous context in the case of difficult queries,” the researchers write.

“I don’t know” (IDK): The model is given a long story and asked to reply multiple-choice questions about it. For some questions, the context does not provide answers, and the model must find a way to acknowledge the limits of its knowledge and answer “I don’t know”. “IDK measures a model’s ability to understand whether it knows what it doesn’t know, based on the context it is presented with,” the researchers write.

Hidden structure queries

Tasks in Michelangelo are based on a novel framework called latent structure queries (LSQ). LSQ provides a general approach to designing long-context reasoning assessments that could be prolonged to any length. It can even test the model’s understanding of hidden information versus retrieving easy facts. LSQ relies on test data synthesis to avoid the pitfalls of leaking test data into the training corpus.

“By requiring the model to extract information from structures rather than values from keys (marble carvings rather than needles from haystacks), we can more deeply test the language model’s understanding of the context in a way that makes it impossible to retrieve,” the researchers write.

The LSQ has three key differences from other long-context LLM assessment approaches. First, it was explicitly designed to avoid biases in judgments that stretch beyond search tasks. Second, it specifies a methodology for independently increasing task complexity and context length. Finally, it is general enough to capture a wide selection of reasoning tasks. Michelangelo’s three tests include interpreting code and reasoning from loosely written text.

“The goal is that long-term, beyond-reasoning evaluations implemented by following the LSQ will lead to fewer scenarios in which the proposed evaluation is limited to solving the search task,” Vodrahalli said.

Evaluating Michelangelo’s Limit Models

Scientists evaluated ten borderline LLMs on Michelangelo, including various variants of Gemini, GPT-4 and 4o, and Claude. They tested the models in contexts of as much as 1 million tokens. The Gemini models performed best in the MRCR test, the GPT models stood out in the latent list, and the Claude 3.5 Sonnet achieved the highest IDK scores.

However, all models showed a significant drop in performance as the complexity of the reasoning tasks increased, suggesting that even with very long context windows, current LLMs still have the potential to enhance their ability to reason from large amounts of information.

“Boundary models can improve all the basic elements of reasoning beyond search (Latent List, MRCR, IDK) that we examine in the Michelangelo study,” Vodrahalli said. “Different boundary models have different strengths and weaknesses – each class performs well in different contexts and for different tasks. What appears to be universal across all models is an initial drop in performance on tasks requiring long reasoning.

Michelangelo’s assessments capture the fundamental elements vital for long-term reasoning, and their findings may have essential implications for corporate applications. For example, in real-world applications where the model cannot rely on its pre-training knowledge and must perform multi-hop inference at many different locations in very long contexts, Vodrahalli expects performance to say no as the context length increases.

“This is especially true if documents contain a lot of information that is unrelated to a specific task, making it difficult for the model to immediately distinguish what information is relevant and what is not,” Vodrahalli said. “It is also likely that models will continue to perform well on tasks where all the relevant information needed to answer a question is located in one general place in the document.”

The researchers will proceed so as to add more assessments of Michelangelo and hope to make them available directly so that other researchers can test their models on them.

VB every day

Stay up so far! Get the latest news in your inbox every day

By subscribing, you comply with the VentureBeat Terms of Service.

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

DeepMind’s Michelangelo benchmark exposes the limitations of long-context LLMs

Better long-term benchmarking is needed

Michelangelo

Hidden structure queries

Evaluating Michelangelo’s Limit Models

Latest Posts

Recomended