Like the test time, the test time unlocks the hidden reasoning abilities in small language models (and allows them to surpass LLM)

Like the test time, the test time unlocks the hidden reasoning abilities in small language models (and allows them to surpass LLM)


Very small language models (SLM) can surpass leading models of huge languages ​​(LLM) in the tasks of reasoning, in accordance with New study Author: Shanghai Ai Laboratory. The authors show that with appropriate tools and techniques of scaling SLM test time with 1 billion parameters can exceed LLM 405B on complex mathematical tests.

The ability to implement SLM in complex reasoning will be very useful because enterprises are looking for latest ways to use these latest models in various environments and applications.

- Advertisement -

The scaling of the test time was explained

Test time scaling (TTS) is the strategy of giving LLM additional computing cylinders when applying to improve their efficiency in various tasks. Leading reasoning models, equivalent to OpenAI O1 and Deepseek-R1, use “internal TTS”, which implies that they are trained to “think” slowly by generating a long string of chain tokens (COT).

An alternative approach is “external TTS”, in which the model’s performance is strengthened (as the name suggests) external assistance. The external TTS is suitable for changing the purpose of the output models to reason the tasks without further tuning them. The external TTS configuration normally consists of the “model of the principles”, which is the principal LLM generating the answer, and the process award model (PRM), which assesses the answers of the principles model. These two components are connected by sampling or searching.

The easiest configuration is the “best-of-N”, in which the principles model generates many answers, and PRM selects one or more best answers to compose the final answer. More advanced TTS external methods use search. In “beam search” the model spreads the answer to many steps.

For each step he tries many answers and triggers them through PRM. Then he chooses one or more appropriate candidates and generates the next response step. In “verifier tree search engine” (DVT) the model generates several response departments to create a more diverse set of candidates’ answers before synthesizing them in the final answer.

What is the correct scaling strategy?

Choosing the right TTS strategy depends on many aspects. The authors of the study conducted a systematic study on how various political models and PRMs affect the performance of TTS methods.

Their findings show that performance depends largely on models of principles and PRM. For example, in the case of small models of principles based on searching, they exceed the Best-NN. However, in the case of huge models of Best-of-N policy, it is more practical, because models have higher reasoning and do not need a prize model to confirm each stage of their reasoning.

Their findings also show that the proper TTS strategy depends on the difficulty of the problem. For example, in the case of small models of principles with lower than 7b parameters, the best NN works higher for easy problems, while searching a beam works higher for harder problems. In the case of models of principles that have parameters from 7b to 32b, various trees searching works well for easy and medium problems, and the search for beam works best on difficult problems. But in the case of huge models of principles (72B and not only) BEST-NN is an optimal method for all difficulty levels.

Why small models can overcome large models

Based on these findings, programmers can create optimal TTS strategies that take into account the model of principles, PRMs and difficulties with the problem to best use the calculation budget to solve the problems of reasoning.

For example, scientists have found that the Llam-3.2-3B model with TTS computer strategy exceeds Llam-3.1-405B at Math-500 and AIME24, two complicated mathematical tests. This shows that SLM can exceed the model by 135x larger when using the TTS computer strategy.

In other experiments, they found that the QWEN2.5 model with 500 million parameters can exceed GPT-4O thanks to the appropriate TTS computer strategy. Using the same strategy, the distilled version of the Deepseek-R1 1.5b exceeded O1-Review and O1-Mini at Math-500 and Aime24.

Taking into account each training and computing budgets, discoveries show that thanks to the optimal strategies of SLM calculation scaling, they will exceed larger models by 100-1000x less flap.

The results of scientists show that computer -optimal TT significantly increases the abilities of the reasoning of language models. However, as the policy model increases, TT improvement regularly decreases.

“This suggests that the effectiveness of TTS is directly related to the ability to reason the politics model,” scientists write. “In particular, in the case of models with poor reasoning abilities, scaling of test time calculations leads to significant improvement, while in the case of models with strong reasoning abilities, growth is limited.”

The study confirms that SLM can work higher than larger models when using optimal testing methods. While this study focuses on mathematical references, scientists plan to expand their examination to other reasoning tasks, equivalent to coding and chemistry.

Latest Posts

Advertisement

More from this stream

Recomended