How Microsoft’s next-generation BitNet architecture improves LLM performance

How Microsoft’s next-generation BitNet architecture improves LLM performance


Single-bit large language models (LLMs) have emerged as a promising approach to increasing the availability and affordability of generative AI. By representing model weights with a very limited variety of bits, 1-bit LLMs dramatically reduce the memory and computational resources required to run them.

Microsoft research with its architecture, BitNet pushes the boundaries of 1-bit LLM. In recent paperresearchers introduce BitNet a4.8, a recent technique that further improves the performance of 1-bit LLMs without sacrificing their performance.

- Advertisement -

The rise of 1-bit LLMs

Traditional LLMs use 16-bit floating point numbers (FP16) to represent their parameters. This requires a great amount of memory and computing resources, which limits the availability and implementation options of LLM. Single-bit LLMs address this challenge by drastically reducing the precision of model weights while matching the performance of full-precision models.

Previous BitNet models used 1.58-bit values ​​(-1, 0, 1) to represent model weights and 8-bit values ​​for activation. This approach significantly reduced memory and I/O costs, but the computational cost of matrix multiplication remained a bottleneck, and optimizing neural networks using extremely low bit parameters stays a challenge.

Two techniques help solve this problem. Sparsification reduces the variety of computations by pruning activations with smaller values. This is particularly useful in LLM because activation values ​​are inclined to have a long-tailed distribution, with some very large values ​​and many small ones.

Quantization, on the other hand, uses fewer bits to represent activations, which reduces the computational and memory costs associated with processing them. However, simply reducing the activation precision can result in significant quantization errors and performance degradation.

Moreover, combining sparsification and quantization is difficult and poses particular problems when training 1-bit LLMs.

“Both quantization and sparsification introduce undifferentiated operations, making gradient calculations during training particularly difficult,” Furu Wei, partner research manager at Microsoft Research, told VentureBeat.

Gradient computing is essential for calculating errors and updating parameters when training neural networks. The researchers also needed to be sure that their techniques may very well be effectively implemented on existing hardware while retaining the benefits of sparsification and quantization.

BitNet a4.8

BitNet a4.8 addresses the challenges of optimizing 1-bit LLMs through what researchers call “hybrid quantization and sparsification.” They achieved this by designing an architecture that selectively applies quantization or sparsification to different components of the model based on a specific activation distribution pattern. The architecture uses 4-bit activations for inputs to the attention and forwarding network (FFN) layers. It uses sparsification with 8 bits for intermediate states, retaining only the top 55% of the parameters. The architecture is also optimized to leverage existing hardware.

“In BitNet b1.58, the 1-bit LLM inference bottleneck switches from memory/IO to computation that is limited by activation bits (i.e. 8-bit in BitNet b1.58),” Wei said. “In BitNet a4.8, we shift the activation bits to 4-bit, so we are able to leverage 4-bit kernels (e.g. INT4/FP4) to offer twice the speed of LLM inference on GPU devices. The combination of 1-bit model weights from BitNet b1.58 and 4-bit activations from BitNet a4.8 effectively addresses each memory/IO and computational constraints in LLM inference.

BitNet a4.8 also uses 3-bit values ​​to represent key (K) and value (V) states in the attention mechanism. KV cache is a key component of transformer models. Stores representations of the previous tokens in the sequence. By lowering the precision of KV cache values, BitNet a4.8 further reduces memory requirements, especially for long sequences.

BitNet a4.8 Promise

Experimental results show that BitNet a4.8 provides comparable performance to its predecessor BitNet b1.58 while consuming less processing power and memory.

Compared to full-precision Lamy models, BitNet a4.8 reduces memory consumption by 10 times and achieves 4 times speedup. Compared to BitNet b1.58, it achieves a 2x speedup because of 4-bit activation kernels. But design can provide much more.

“The estimated computational improvement is based on existing hardware (GPU),” Wei said. “With hardware optimized for 1-bit LLM, computational improvements may be significantly increased. BitNet introduces a recent computational paradigm that minimizes the need for matrix multiplication, which is a primary goal of current hardware design optimization.

The performance of BitNet a4.8 makes it particularly suitable for deploying LLM at the edge and on resource-constrained devices. This may have significant privacy and security implications. By enabling LLM on the device, users can benefit from the capabilities of those models without having to send their data to the cloud.

Wei and his team proceed to work on 1-bit LLMs.

“We continue to advance our research and vision for the 1-bit LLM era,” Wei said. “While our current focus is on model architecture and software support (i.e. bitnet.cpp), our goal is to explore the co-design and evolution of model and hardware architecture to fully unlock the potential of 1-bit LLMs.”

Latest Posts

Advertisement

More from this stream

Recomended