As enterprises proceed to adopt large language models (LLM) across a number of applications, one of the key challenges they face is improving model subject material knowledge and reducing hallucinations. In a new article, scientists from Meta-artificial AI propose “scalable memory tiers”, which could also be one of several possible solutions to this problem.
Scalable memory layers add more parameters to LLMs to extend their learning capability without requiring additional computational resources. The architecture is useful in applications where additional memory could be saved for factual knowledge, but at the same time speed of inference in the type of more agile models is required.
Dense and memory layers
Traditional language models use “dense layers” to encode massive amounts of knowledge into their parameters. In dense layers, all parameters are used at full capability and are most frequently activated at the same time during inference. Dense layers can learn complex features, and scaling them up requires additional computational and energy resources.
In contrast, for easy factual knowledge, much simpler layers with associative memory architectures can be more efficient and easier to interpret. This is what memory layers do. They use easy, sparse activations and key-value lookup mechanisms to encode and retrieve knowledge. Sparse layers take up more memory than dense layers, but only use a small fraction of the parameters at a time, making them much more computationally efficient.
Memory layers have been around for several years, but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.
Current pioneering LLMs typically use some type of “mix of experts” (MoE) architecture, which uses a mechanism somewhat just like memory tiers. MoE models consist of many smaller expert components that specialize in specific tasks. At the time of inference, the routing engine determines which expert might be activated based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to tens of millions of experts, providing more granular control over the parameters activated during inference.
Updating memory layers
Memory tiers devour little processing power but are memory-intensive, creating particular challenges for current hardware and software structures. In their paper, Meta researchers propose several modifications that address these challenges and enable their large-scale use.
First, the researchers configured the memory layers for parallelism, spreading them across several GPUs to store tens of millions of key-value pairs without changing other layers in the model. They also implemented a special CUDA kernel to handle operations that require high memory bandwidth. They also developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers inside the model. This means that the keys and values used for searching are common across layers.
These modifications enable the implementation of memory tiers inside LLM without slowing down the model.
“Memory layers with their sparse activations nicely complement dense networks, providing increased capacity for knowledge acquisition while providing low computational overhead,” the researchers write. “They can scale efficiently and provide practitioners with a compelling new direction in the memory-computing trade-off.”
To test the memory layers, researchers modified Lamy’s models by replacing one or more dense layers with a shared memory layer. They compared memory-enhanced models with dense LLMs, in addition to MoE and PEER models, on several tasks, including answering fact-based questions, science and common sense knowledge, and coding.
Their findings show that memory models improve significantly over dense baselines and compete with models that use 2-4 times more computational power. They also match the performance of MoE models that have the same computational budget and variety of parameters. The model’s performance is particularly noteworthy for tasks that require subject material knowledge. For example, when it involves answering fact-based questions, the 1.3 billion-parameter memory model approaches the performance of Llama-2-7B, which was trained on twice as many tokens and 10 times as much processing power.
Moreover, the researchers found that the advantages of memory models remained consistent with model size when they scaled their experiments from 134 million to eight billion parameters.
“Given these findings, we strongly advocate the integration of memory layers into all next-generation AI architectures,” the researchers write, while adding that there is still much more room for improvement. “In particular, we hope to develop new learning methods that will further improve the effectiveness of these layers, enabling less forgetting, less hallucinations, and continuous learning.”