In a recent case study, Hugging Face researchers demonstrated how small language models (SLMs) might be configured to outperform much larger models. Their findings show that the Lamy 3 model with 3B parameters can outperform the 70B version of the model on complex math problems.
Hugging Face has fully documented the entire process and provides a roadmap for businesses trying to create their own tailored reasoning models.
Scaling calculations during the test
The work is inspired by OpenAI o1, which uses additional “thinking” to resolve complex math, coding and reasoning problems.
The key idea behind models like o1 is to scale “computation at test time,” which in practice means using more computation cycles during inference to check and confirm different answers and reasoning paths before arriving at a final answer. Scaling calculations at test time is especially useful when there is not enough memory to run a large model.
Because o1 is a private model and OpenAI is silent about its inner workings, researchers are speculating about how it really works and attempting to reverse engineer the process. There are already several open alternatives to o1.
Hugging Face’s work builds on a DeepMind study published in August that examined the trade-offs between inference time and pre-training computation. The study provides comprehensive guidelines for balancing training and inference calculations to acquire the best results inside a set budget.
In addition to the use of additional inference time computation, the success of this method depends on two key elements: a reward model that evaluates SLM responses, and a search algorithm that optimizes the path needed to refine the response.
Various reasoning algorithms
The simplest option to use scaling at test time is “majority voting”, where the same prompt is sent to the model multiple times and the one with the most votes is chosen. Majority voting might be useful on easy problems, but its effectiveness quickly plateaus on complex reasoning problems or tasks where errors are consistent across generations.
A more advanced inference method is “Best-of-N”. In this method, SLM generates multiple responses, but as an alternative of majority voting, a reward model is used to guage the responses and select the best one. “Weighted Best-of-N,” a more nuanced version of this method, considers consistency in choosing responses that are each certain and occur more regularly than others.
The researchers used a “process reward model” (PRM), which evaluates SLM’s response not only based on its final response, but also on the multiple steps it goes through to attain it. Their experiments showed that Weighted Best-of-N and PRM brought the Lama-3.2 1B closer to the level of the Lama-3.2 8B in the difficult MATH-500 test.
Adding a search
To further improve the model’s performance, researchers added search algorithms to the model’s inference process. Instead of generating the answer in one pass, they used “beam search” – an algorithm that guides the model’s answering process step by step.
At each stage, SLM generates many partial answers. The search algorithm uses a reward model to guage the responses and selects a subset price further evaluation. The process is repeated until the model exhausts its inference budget or obtains the correct answer. In this manner, the inference budget might be narrowed to focus on the most promising answers.
The researchers found that while beam search improves model performance for complex problems, it tends to perform worse than other techniques for easy problems. To meet this challenge, they added two more elements to their inference strategy.
The first was Diverse Verifier Tree Search (DVTS), a variant of bundle search that ensures SLM doesn’t get stuck in false reasoning paths and diversifies response branches. Second, they developed a “computationally optimal scaling strategy,” as suggested in the DeepMind paper, which dynamically selects the best scaling strategy at test time based on the difficulty of the input problem.
The combination of those techniques allowed the Llama-3.2 1B to punch above its weight and give it a significant advantage over the 8B. They also found that the strategy was scalable, and when applied to the Llama-3.2 3B, it was capable of achieve higher results than the much larger 70B model.
This is not a perfect solution yet
Scaling the calculations during the test changes the cost dynamics of the model. Enterprises now have the ability to decide on where to allocate their computing resources. For example, if you are low on memory or can tolerate slower response times, you should utilize a small model and spend more cycles of inference time to generate more accurate answers.
However, scaling test time also has its limitations. For example, in the experiments conducted by Hugging Face, researchers used the specially trained Llama-3.1-8B model as the PRM, which requires running two models in parallel (even if it is much more resource efficient than the 70B model). Researchers admit that the holy grail of scaling testing time is “self-verification,” where the original model verifies its own response slightly than relying on an external verifier. This is an open area of research.
The test time scaling technique presented in this study is also limited to problems where the response might be clearly assessed, resembling coding and mathematics. Creating reward and validator models for subjective tasks resembling creative writing and product design requires further research.
However, it is clear that test time scaling has generated a lot of interest and activity, so we will expect more tools and techniques to emerge in the coming months. Businesses could be smart to maintain an eye on how the landscape develops.