Meta proposes new scalable memory layers that improve knowledge and reduce hallucinations

As enterprises proceed to adopt large language models (LLM) across a number of applications, one of the key challenges they face is improving model subject material knowledge and reducing hallucinations. In a new article, scientists from Meta-artificial AI propose “scalable memory tiers”, which could also be one of several possible solutions to this problem.

Scalable memory layers add more parameters to LLMs to extend their learning capability without requiring additional computational resources. The architecture is useful in applications where additional memory could be saved for factual knowledge, but at the same time speed of inference in the type of more agile models is required.

- Advertisement -

Dense and memory layers

Traditional language models use “dense layers” to encode massive amounts of knowledge into their parameters. In dense layers, all parameters are used at full capability and are most frequently activated at the same time during inference. Dense layers can learn complex features, and scaling them up requires additional computational and energy resources.

In contrast, for easy factual knowledge, much simpler layers with associative memory architectures can be more efficient and easier to interpret. This is what memory layers do. They use easy, sparse activations and key-value lookup mechanisms to encode and retrieve knowledge. Sparse layers take up more memory than dense layers, but only use a small fraction of the parameters at a time, making them much more computationally efficient.

Memory layers have been around for several years, but are rarely used in modern deep learning architectures. They are not optimized for current hardware accelerators.

Current pioneering LLMs typically use some type of “mix of experts” (MoE) architecture, which uses a mechanism somewhat just like memory tiers. MoE models consist of many smaller expert components that specialize in specific tasks. At the time of inference, the routing engine determines which expert might be activated based on the input sequence. PEER, an architecture recently developed by Google DeepMind, extends MoE to tens of millions of experts, providing more granular control over the parameters activated during inference.

Updating memory layers

Memory tiers devour little processing power but are memory-intensive, creating particular challenges for current hardware and software structures. In their paper, Meta researchers propose several modifications that address these challenges and enable their large-scale use.

First, the researchers configured the memory layers for parallelism, spreading them across several GPUs to store tens of millions of key-value pairs without changing other layers in the model. They also implemented a special CUDA kernel to handle operations that require high memory bandwidth. They also developed a parameter sharing mechanism that supports a single set of memory parameters across multiple memory layers inside the model. This means that the keys and values used for searching are common across layers.

These modifications enable the implementation of memory tiers inside LLM without slowing down the model.

“Memory layers with their sparse activations nicely complement dense networks, providing increased capacity for knowledge acquisition while providing low computational overhead,” the researchers write. “They can scale efficiently and provide practitioners with a compelling new direction in the memory-computing trade-off.”

To test the memory layers, researchers modified Lamy’s models by replacing one or more dense layers with a shared memory layer. They compared memory-enhanced models with dense LLMs, in addition to MoE and PEER models, on several tasks, including answering fact-based questions, science and common sense knowledge, and coding.

Their findings show that memory models improve significantly over dense baselines and compete with models that use 2-4 times more computational power. They also match the performance of MoE models that have the same computational budget and variety of parameters. The model’s performance is particularly noteworthy for tasks that require subject material knowledge. For example, when it involves answering fact-based questions, the 1.3 billion-parameter memory model approaches the performance of Llama-2-7B, which was trained on twice as many tokens and 10 times as much processing power.

Moreover, the researchers found that the advantages of memory models remained consistent with model size when they scaled their experiments from 134 million to eight billion parameters.

“Given these findings, we strongly advocate the integration of memory layers into all next-generation AI architectures,” the researchers write, while adding that there is still much more room for improvement. “In particular, we hope to develop new learning methods that will further improve the effectiveness of these layers, enabling less forgetting, less hallucinations, and continuous learning.”

Daily insight into business use cases with VB Daily

If you would like to impress your boss, VB Daily will allow you to do just that. We provide you with insight into what firms are doing with generative AI, from regulatory developments to practical implementations, so you’ll be able to share your insights for maximum return on your investment.

Read our Privacy Policy

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

Government investments in failed technology ventures are not a waste of taxpayer money

Innovators Wanted: 3 Areas of Pet Care Prepared for Disruption

North American startup funding spikes by 2024 close

Startup financing regains its footing in 2024 as artificial intelligence becomes the star of the show

Raising capital? How to build strategic investor partnerships

The future of risk management is here

How to evolve from manager to mentor and make a lasting impact

What to expect from the franchise in 2025

8 ways your brand is failing customers and your growth

The secret weapon of business growth? Your way of thinking

Brothers launched with revenues exceeding $45 million within 3 years

3 challenges that entrepreneurs will face in 2025

Use this framework to effectively integrate AI into your business operations

Why your AI strategy will fail without the right talent

Don’t make these 5 critical mistakes when planning for next year

North American startup funding spikes by 2024 close

Top networking tips to set your nonprofit up for success

How These 3 Financial Tips Can Benefit Your Startup Stage

AI startup Cloud Vultr has raised $333 million to $3.5 billion in its first round of external funding

Biggest funding rounds this week: Tenstorrent and Nuvig top picks

Meta proposes new scalable memory layers that improve knowledge and reduce hallucinations

Dense and memory layers

Updating memory layers

Latest Posts

Government investments in failed technology ventures are not a waste of...

Brothers launched with revenues exceeding $45 million within 3 years

3 Strategies to Make Your Next Campaign Go Viral

Innovators Wanted: 3 Areas of Pet Care Prepared for Disruption

Qualcomm introduces AI chips for desktop computers, cars, smart homes and...

Nuwa Pen uses computer vision and motion detection to digitize the...

Swave Photonics raises $28.3 million for holographic smart glasses and 3D...

Peter Thiel-Backed Tacora Capital Raises Nearly $270M for Second Debt Fund...

Recomended

Government investments in failed technology ventures are not a waste of taxpayer money

Brothers launched with revenues exceeding $45 million within 3 years

3 Strategies to Make Your Next Campaign Go Viral

Innovators Wanted: 3 Areas of Pet Care Prepared for Disruption

The future of risk management is here

North American startup funding spikes by 2024 close