Huawei computer laboratory in Zurich introduced A new approach to open source quantization In the case of huge languages (LLM) models to reduce memory requirements without dedicating the output quality.
Technique, called Sinq (quantization normalized with the sink)It was designed to be fast, free from calibration and easy to include in existing models of work flows. The code to be made was made available by the Huawei research team Girub AND Hugging According to the permissible, entrepreneurship-friendly, Apache 2.0 license, enabling organizations to receive and use, modify and implement commercially for free.
In models of various sizes, sinq crosses memory consumption by 60–70%depending on architecture and bit width.
This allows you to launch models that may need> 60 GB of memory before ~ 20 GB configurations-Critical switches for launching large models on one high-class GPU, and even configurations of the Multi-GPU consumer class.
This allows you to launch models that previously needed high-class graphic processors of enterprises like the A100 or H100 NVIDIA-W much more cost-effective price equipment, resembling single Nvidia GeForce RTX 4090 (around USD 1,600), as an alternative of a company equipment, resembling A100 80 GB ($ 19,000) or even H100 units exceed USD 30,000.
In the case of teams using cloud infrastructure, savings are similarly tangible. Instals based on the A100 often cost USD 3-4.50 per hour, while 24 GB GPU, resembling RTX 4090, are available on many platforms for USD 1-1.50 per hour.
Over time, especially in the case of prolonged application loads, this difference may add Thousands of dollars in cost reductionsAt the same time, unlocking LLM implementation on smaller clusters, local workstations or consumer class configurations previously limited by memory.
Employing the challenge of LLMS memory
Starting large models often requires compromises between performance and size.
In practice, used neural networks Flooring numbers Represent each scales and activations. The variable -cut number can express a big selection of values (very small, very large, with fractional parts).
This flexibility is helpful, because during training and importance and activation will be significantly different on a scale. Using a variable -point point allows precise adjustment. (For example, the weight will be 0.0023 or 123.45, and the floating point point can capture each with decent precision.)
Quantization-method, which reduces the precision of the mass mass-offers a practical path to lower memory consumption, but often has the quality compromises of the model, especially with 4-bit precision and below.
When converting these floating point values for lower precision formats (resembling 8-bit integers), you bring them closer.
This implies that you store and calculate with fewer bits, which is faster and more efficient-but you risk the lack of loyalty (i.e. introducing small errors).
The trick is a cautious conversion so that the model’s behavior stays almost the same, although it really works internally with harder approaches and activation.
Sinq deals with these pain points, introducing a plug-and-play solution, which provides good performance even in low precision settings-including calibration data requirements or the relationship between layers.
How sinq works
The SINQ approach introduces two major innovations:
-
Double axis scaling: Instead of using a single scale coefficient for quantization of the matrix, SINQ uses separate scaling vectors for rows and columns. This helps to alleviate the effects of protruding values and allows more flexible distribution of quantization error into a matrix.
-
Sinkhorn-n-knopp normalization: The fast algorithm inspired by the iteration of Sinkhorn is used to normalize standard deviations of poems and columns in the matrix. This helps minimize what the authors call “maternal imbalance”, a new proxy record, which is more practical than alternatives resembling Kurtoza to improve quantization performance.
The combination of those two functions allows SINQ to exceed other calibration techniques, resembling quantization based on the Round to Nearest (RTN), HQQ and Hadamard on many comparative tests.
Performance and compatibility
Sinq was rated in a big selection of architecture and models, including the QWEN3 series, Lama and Deepseek.
On comparative tests resembling Wikitext2 and C4, SINQ consistently reduces the embarrassment and retreat speed compared to the output methods, often approaching the efficiency of calibrated solutions.
It also supports uneven quantization patterns, resembling NF4 and will be combined with calibration methods resembling AWQ, leading to the A-SINQ variant. In calculated A-SINQ settings, it moreover narrows the gap using whole models.
In terms of performance, SINQ quantizes models about twice as fast as HQQ and greater than 30 times faster than AWQ. This makes it good suitable for each research and production environments, in which quantization time is a practical limitation.
Open source and easy to use
Huawei issued a sinq as an open source project as a part of the Apache 2.0 -friendly license for enterprises, with implementation instructions and playback tools available in GitHub:
The repository comprises the quantization support for hugging facial models with just a few lines of code, in addition to tools for saving and recharging quantized weights. The default settings offer a balance between memory savings and accuracy, and users can adapt parameters resembling bit width, tile strategy and group size depending on their needs.
The authors also ensure the integration of the assessment through lm-eval Library and plan in the near future release of pre -equipped models on Hugging Face Hub.
Looking to the future
Along with the growing demand to launch large models on consumer class equipment, quantization becomes an indispensable tool. Sinq goals to reduce the entry barrier for LLM implementation, enabling programmers and researchers to effectively reduce models without major quality or compatibility compromises.
Further updates-in this integration with hugging facial transformers and pre-released versions of the model-planned, which makes the project to watch in the quantization space.
