Researchers at Google have developed a latest artificial intelligence paradigm that goals to solve one of the biggest limitations of today’s large language models: the inability to learn or update knowledge after training. The so-called paradigm Nested learningtransforms the model and its training not as a single process, but as a system of nested, multi-level optimization problems. The researchers argue that this approach can unlock more expressive learning algorithms, leading to raised contextual learning and higher memory.
To prove their concept, the researchers used Nested Learning to develop a latest model called Hope. Preliminary experiments show it has excellent performance on language modeling, continuous learning, and long-context reasoning tasks, potentially paving the way for efficient artificial intelligence systems that can adapt to real-world environments.
The memory problem of large language models
Deep learning algorithms helped eliminate the need for careful design and specialized knowledge required in traditional machine learning. By feeding the models huge amounts of data, they may learn the mandatory representations on their very own. However, this approach presented its own set of challenges that would not be solved by simply stacking more layers or creating larger networks, resembling generalizing to latest data, continually learning latest tasks, and avoiding suboptimal solutions during training.
Efforts to beat these challenges have led to innovations that have led to Transformersbeing the basis of modern large language models (LLM). These models have initiated a “paradigm shift from task-based models to more general-purpose systems with a variety of emergent capabilities as a result of scaling ‘right’ architectures,” the researchers write. Still, a fundamental limitation stays: post-training LLMs are largely static and cannot update their core knowledge or acquire latest skills through latest interactions.
The only flexible element of the LLM is its contextual learning an ability that enables him to perform tasks based on information provided by his direct prompt. This makes current LLMs analogous to a one who cannot form latest long-term memories. Their knowledge is limited to what they learned during initial training (distant past) and what is in their current context window (immediate present). Once the conversation moves beyond the context window, this information is lost eternally.
The problem is that today’s Transformer-based LLMs do not have an ‘online’ consolidation mechanism. The information in the context window never updates the model’s long-term parameters – the weights stored in its forwarding layers. As a result, the model cannot sustainably acquire latest knowledge or skills through interaction; the whole lot learned disappears as soon as the context window moves.
A nested approach to learning
Nested learning (NL) is designed to enable computational models to learn from data using different levels of abstraction and time scales, much like the brain. It treats a single machine learning model not as one continuous process, but as a system of interconnected learning problems that are optimized concurrently at different speeds. This is a departure from the classical view that treats the model architecture and its optimization algorithm as two separate elements.
In this paradigm, the training process is viewed as developing “associative memory”, the ability to mix and recall related information. The model learns to map a data point to its local error, which measures how “surprising” that data point was. Even key architectural elements resembling the attention mechanism in transformers can be viewed as easy associative memory modules that learn mappings between tokens. By defining the update frequency for each component, these nested optimization problems can be organized into different “levels”, forming the core of the NL paradigm.
Hope for continuous learning
Researchers put these principles into practice in Hope, an architecture designed to embody embedded learning. Hope is a modified version Titansone other architecture introduced by Google in January to deal with the memory limitations of the Transformer model. Although the Titans had a powerful memory system, its parameters only updated at two different speeds: the long-term memory module and the short-term memory mechanism.
Hope is a self-modifying architecture enhanced with a “Continuum Memory System” (CMS) that permits unlimited levels of in-context learning and scaling to larger context windows. The CMS works like a series of memory banks, each updating at a different frequency. Faster updating banks process immediate information, while slower ones consolidate more abstract knowledge over longer periods. This allows the model to optimize its own memory in a self-referential loop, creating an architecture with theoretically infinite levels of learning.
Across a diverse set of language modeling and common sense reasoning tasks, Hope demonstrated lower confusion (a measure of how well the model predicts the next word in the sequence and maintains consistency in the generated text) and greater accuracy in comparison with each standard transformers and other modern reoccurrence models. Hope also performed higher on long-context “Needle in a Haystack” tasks, in which the model must find and use specific information hidden in a large volume of text. This suggests that the CMS offers a more efficient strategy to handle long sequences of information.
This is one of several attempts to create AI systems that process information at various levels. Hierarchical model of reasoning (HRM) from Sapient Intelligence used a hierarchical architecture to make the model more practical at learning reasoning tasks. A small model of reasoning (TRM), a model from Samsung, improves HRM by making changes to its architecture, improving its performance while increasing its efficiency.
While promising, Nested Learning faces some of the same challenges as other paradigms in realizing its full potential. Current AI hardware and software stacks are largely optimized for classic deep learning architectures, and Transformer models in particular. Adopting embedded learning at scale may require fundamental changes. However, if it gains momentum, it could result in much more efficient LLMs that can repeatedly learn, a critical skill for real-world enterprise applications where environments, data, and user needs are continually changing.
