Hugging Face shows how scaling testing time helps small language models punch above their weight

In a recent case study, Hugging Face researchers demonstrated how small language models (SLMs) might be configured to outperform much larger models. Their findings show that the Lamy 3 model with 3B parameters can outperform the 70B version of the model on complex math problems.

Hugging Face has fully documented the entire process and provides a roadmap for businesses trying to create their own tailored reasoning models.

- Advertisement -

Scaling calculations during the test

The work is inspired by OpenAI o1, which uses additional “thinking” to resolve complex math, coding and reasoning problems.

The key idea behind models like o1 is to scale “computation at test time,” which in practice means using more computation cycles during inference to check and confirm different answers and reasoning paths before arriving at a final answer. Scaling calculations at test time is especially useful when there is not enough memory to run a large model.

Because o1 is a private model and OpenAI is silent about its inner workings, researchers are speculating about how it really works and attempting to reverse engineer the process. There are already several open alternatives to o1.

Hugging Face’s work builds on a DeepMind study published in August that examined the trade-offs between inference time and pre-training computation. The study provides comprehensive guidelines for balancing training and inference calculations to acquire the best results inside a set budget.

In addition to the use of additional inference time computation, the success of this method depends on two key elements: a reward model that evaluates SLM responses, and a search algorithm that optimizes the path needed to refine the response.

Various reasoning algorithms

The simplest option to use scaling at test time is “majority voting”, where the same prompt is sent to the model multiple times and the one with the most votes is chosen. Majority voting might be useful on easy problems, but its effectiveness quickly plateaus on complex reasoning problems or tasks where errors are consistent across generations.

A more advanced inference method is “Best-of-N”. In this method, SLM generates multiple responses, but as an alternative of majority voting, a reward model is used to guage the responses and select the best one. “Weighted Best-of-N,” a more nuanced version of this method, considers consistency in choosing responses that are each certain and occur more regularly than others.

The researchers used a “process reward model” (PRM), which evaluates SLM’s response not only based on its final response, but also on the multiple steps it goes through to attain it. Their experiments showed that Weighted Best-of-N and PRM brought the Lama-3.2 1B closer to the level of the Lama-3.2 8B in the difficult MATH-500 test.

Adding a search

To further improve the model’s performance, researchers added search algorithms to the model’s inference process. Instead of generating the answer in one pass, they used “beam search” – an algorithm that guides the model’s answering process step by step.

At each stage, SLM generates many partial answers. The search algorithm uses a reward model to guage the responses and selects a subset price further evaluation. The process is repeated until the model exhausts its inference budget or obtains the correct answer. In this manner, the inference budget might be narrowed to focus on the most promising answers.

The researchers found that while beam search improves model performance for complex problems, it tends to perform worse than other techniques for easy problems. To meet this challenge, they added two more elements to their inference strategy.

The first was Diverse Verifier Tree Search (DVTS), a variant of bundle search that ensures SLM doesn’t get stuck in false reasoning paths and diversifies response branches. Second, they developed a “computationally optimal scaling strategy,” as suggested in the DeepMind paper, which dynamically selects the best scaling strategy at test time based on the difficulty of the input problem.

The combination of those techniques allowed the Llama-3.2 1B to punch above its weight and give it a significant advantage over the 8B. They also found that the strategy was scalable, and when applied to the Llama-3.2 3B, it was capable of achieve higher results than the much larger 70B model.

This is not a perfect solution yet

Scaling the calculations during the test changes the cost dynamics of the model. Enterprises now have the ability to decide on where to allocate their computing resources. For example, if you are low on memory or can tolerate slower response times, you should utilize a small model and spend more cycles of inference time to generate more accurate answers.

However, scaling test time also has its limitations. For example, in the experiments conducted by Hugging Face, researchers used the specially trained Llama-3.1-8B model as the PRM, which requires running two models in parallel (even if it is much more resource efficient than the 70B model). Researchers admit that the holy grail of scaling testing time is “self-verification,” where the original model verifies its own response slightly than relying on an external verifier. This is an open area of research.

The test time scaling technique presented in this study is also limited to problems where the response might be clearly assessed, resembling coding and mathematics. Creating reward and validator models for subjective tasks resembling creative writing and product design requires further research.

However, it is clear that test time scaling has generated a lot of interest and activity, so we will expect more tools and techniques to emerge in the coming months. Businesses could be smart to maintain an eye on how the landscape develops.

Daily insight into business use cases with VB Daily

If you wish to impress your boss, VB Daily will allow you to do just that. We offer you insight into what firms are doing with generative AI, from regulatory developments to practical implementations, so you may share your insights for maximum return on your investment.

Read our Privacy Policy

Thank you for subscribing. Find more VB newsletters here.

An error occurred.

How mentoring shapes resilient leaders and thriving teams

7 Steps to Set Up Your Business for Financial Success in 2025

New research reveals startup challenges in the race to adopt GenAI

Erin Andrews breaks down Taylor Swift’s influence on business

12 reasons why I decided to build a startup on my own

This coffee and gardening store’s strategies for attracting customers

Why streamlining your operations now is key to business success in 2025

A crisis can strike at any time – even on holidays. Are you ready?

How contact with the right audience ensures business success

Why companies are relying on automation to survive the labor market crisis

Mark Cuban: The ’60s are the new ’40s.

How mission-driven leadership drives growth in the digital age

Kevin O’Leary: Here’s how and when to fire someone

Chris Diamantopoulos in his new Amazon show “The Sticky”

AI startup Cloud Vultr has raised $333 million to $3.5 billion in its first round of external funding

Biggest funding rounds this week: Tenstorrent and Nuvig top picks

Global VC funding surged in November, driven by artificial intelligence and billion-dollar deals

Wealth Management Startups See Funding Growth

Biggest Funding Rounds This Week: xAI and Anthropic Headline Big week for AI (again)

Hugging Face shows how scaling testing time helps small language models punch above their weight

Scaling calculations during the test

Various reasoning algorithms

Adding a search

This is not a perfect solution yet

Latest Posts

The Big Five weren’t focused on startup investments in 2024

Google unveils new Gemini 2.0 Flash Thinking reasoning model that can...

Great ways to improve your business in 2025

Google unveils new Gemini 2.0 Flash Thinking reasoning model that can...

A big tongue-in-cheek: How SLM companies can beat their larger, resource-intensive...

Forecast: 2024 was a slow year for tech IPOs, but 2025...

Agave Games has raised $18 million to develop its Find the...

Recomended

The Big Five weren’t focused on startup investments in 2024

Google unveils new Gemini 2.0 Flash Thinking reasoning model that can compete with OpenAI o1

Great ways to improve your business in 2025

Mark Cuban: The ’60s are the new ’40s.

How mentoring shapes resilient leaders and thriving teams