How Microsoft's next-generation BitNet architecture improves LLM performance

Single-bit large language models (LLMs) have emerged as a promising approach to increasing the availability and affordability of generative AI. By representing model weights with a very limited variety of bits, 1-bit LLMs dramatically reduce the memory and computational resources required to run them.

Microsoft research with its architecture, BitNet pushes the boundaries of 1-bit LLM. In recent paperresearchers introduce BitNet a4.8, a recent technique that further improves the performance of 1-bit LLMs without sacrificing their performance.

- Advertisement -

The rise of 1-bit LLMs

Traditional LLMs use 16-bit floating point numbers (FP16) to represent their parameters. This requires a great amount of memory and computing resources, which limits the availability and implementation options of LLM. Single-bit LLMs address this challenge by drastically reducing the precision of model weights while matching the performance of full-precision models.

Previous BitNet models used 1.58-bit values (-1, 0, 1) to represent model weights and 8-bit values for activation. This approach significantly reduced memory and I/O costs, but the computational cost of matrix multiplication remained a bottleneck, and optimizing neural networks using extremely low bit parameters stays a challenge.

Two techniques help solve this problem. Sparsification reduces the variety of computations by pruning activations with smaller values. This is particularly useful in LLM because activation values are inclined to have a long-tailed distribution, with some very large values and many small ones.

Quantization, on the other hand, uses fewer bits to represent activations, which reduces the computational and memory costs associated with processing them. However, simply reducing the activation precision can result in significant quantization errors and performance degradation.

Moreover, combining sparsification and quantization is difficult and poses particular problems when training 1-bit LLMs.

“Both quantization and sparsification introduce undifferentiated operations, making gradient calculations during training particularly difficult,” Furu Wei, partner research manager at Microsoft Research, told VentureBeat.

Gradient computing is essential for calculating errors and updating parameters when training neural networks. The researchers also needed to be sure that their techniques may very well be effectively implemented on existing hardware while retaining the benefits of sparsification and quantization.

BitNet a4.8

BitNet a4.8 addresses the challenges of optimizing 1-bit LLMs through what researchers call “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on a specific activation distribution pattern. The architecture uses 4-bit activations for inputs to the attention and forwarding network (FFN) layers. It uses sparsification with 8 bits for intermediate states, retaining only the top 55% of the parameters. The architecture is also optimized to leverage existing hardware.

“In BitNet b1.58, the 1-bit LLM inference bottleneck switches from memory/IO to computation that is limited by activation bits (i.e. 8-bit in BitNet b1.58),” Wei said. “In BitNet a4.8, we shift the activation bits to 4-bit, so we are able to leverage 4-bit kernels (e.g. INT4/FP4) to offer twice the speed of LLM inference on GPU devices. The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 effectively addresses each memory/IO and computational constraints in LLM inference.

BitNet a4.8 also uses 3-bit values to represent key (K) and value (V) states in the attention mechanism. KV cache is a key component of transformer models. Stores representations of the previous tokens in the sequence. By lowering the precision of KV cache values, BitNet a4.8 further reduces memory requirements, especially for long sequences.

BitNet a4.8 Promise

Experimental results show that BitNet a4.8 provides comparable performance to its predecessor BitNet b1.58 while consuming less processing power and memory.

Compared to full-precision Lamy models, BitNet a4.8 reduces memory consumption by 10 times and achieves 4 times speedup. Compared to BitNet b1.58, it achieves a 2x speedup because of 4-bit activation kernels. But design can provide much more.

“The estimated computational improvement is based on existing hardware (GPU),” Wei said. “With hardware optimized for 1-bit LLM, computational improvements may be significantly increased. BitNet introduces a recent computational paradigm that minimizes the need for matrix multiplication, which is a primary goal of current hardware design optimization.

The performance of BitNet a4.8 makes it particularly suitable for deploying LLM at the edge and on resource-constrained devices. This may have significant privacy and security implications. By enabling LLM on the device, users can benefit from the capabilities of those models without having to send their data to the cloud.

Wei and his team proceed to work on 1-bit LLMs.

“We continue to advance our research and vision for the 1-bit LLM era,” Wei said. “While our current focus is on model architecture and software support (i.e. bitnet.cpp), our goal is to explore the co-design and evolution of model and hardware architecture to fully unlock the potential of 1-bit LLMs.”

VB every day

Stay up to this point! Get the latest news in your inbox every day

By subscribing, you conform to VentureBeat’s Terms of Service.

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

As the Edtech sector matures, the American technology company wants to bring influence on the manager in the classroom

The largest rounds of financing the week: OpenAI easily exceeds a huge week

The turbine collects USD 22 million to help VC investors get cash without selling their rates

Great talent Venture Talent

New session at TechCrunch All Stage: Jahanvi Sardana about how to best transform markets

By 2027, most employees will be a freelancer. Are you ready?

How to set your financial company as an industry leader

Why relying on artificial intelligence can be your biggest business mistake

Use these AI gaps to generate 7-profits

5 simple product hacks that will make you more effective

This is a leader superpower from 2025 – do you have what you need?

Most people make this career mistake. Are you guilty of him?

One thing that ruins your business faster than anything else

Each company will become a crisis – here’s how to adapt quickly

Reflect the potential of brain problems with these 3 Hacks of Neuronuki

Q1 Global Startup Funding will publish the strongest quarter from KW. 2 2022

Start funding is slowed down in February in connection with the uncertainty of the exit

The largest funding rounds of the week: Massive List of Saronic peaks

Nih funding uncertainty Spurs New Biotech Venture Fund

Cleantech Funding for a slow start in 2025

How Microsoft’s next-generation BitNet architecture improves LLM performance

The rise of 1-bit LLMs

BitNet a4.8

BitNet a4.8 Promise

Latest Posts

By 2027, most employees will be a freelancer. Are you ready?

As the Edtech sector matures, the American technology company wants to...

TechCrunch All Stage: Find out how artificial intelligence can pay MVPS...

Invest in artificial intelligence that will make chatbots outdated

Openai has just made chatgpt plus for free for millions of...

It worries the market induced with a tariff in startup and...

In addition to general comparative tests: as Yourbench allows enterprises to...

WHERE Credit’s Nete: Inside Experian AI RAME, which changes financial access

Recomended

By 2027, most employees will be a freelancer. Are you ready?

As the Edtech sector matures, the American technology company wants to bring influence on the manager in the classroom

TechCrunch All Stage: Find out how artificial intelligence can pay MVPS with Chris Gardner

Invest in artificial intelligence that will make chatbots outdated

The largest rounds of financing the week: OpenAI easily exceeds a huge week

I thought I knew business – my MBA turned out to be wrong