DeepMind and UC Berkeley show how to make the most of LLM inference time computations

DeepMind and UC Berkeley show how to make the most of LLM inference time computations

Join our day by day and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI coverage. Learn more


Given the high cost and slow training speed of large language models (LLMs), there are ongoing discussions on whether devoting more computational cycles to inference may also help improve the performance of LLMs without the need to retrain them.

In a latest study, scientists from Deep Mind AND University of California, Berkeley explore ways to improve LLM performance by strategically allocating computational resources during inference. Their findings, detailed in latest research papersuggest that by optimizing the use of computation during inference, LLM models can achieve significant performance gains without the need for larger models or extensive pre-training.

- Advertisement -

Trade-off between inference time and pre-training calculations

The dominant approach to improving LLM performance has been to scale the model size and pre-train the computation. However, this approach has limitations. Larger models are expensive to train and require more resources to run, which may make them impractical to implement in a variety of settings, including resource-constrained devices.

An alternative is to use more computation during inference to improve the accuracy of LLM responses to difficult prompts. This approach may enable the implementation of smaller LLMs while achieving comparable performance to larger, more computationally expensive models.

The query is, how can an LLM model leverage a fixed amount of inference time computation to get the best performance across different inference methods, and how will it perform compared to a larger, pre-trained model?

The most popular approach to scaling test-time computation is best-of-N sampling, in which the model generates N results in parallel, and the most accurate answer is chosen as the final answer. However, there are other ways to use inference-time computation to improve LLM. For example, as an alternative of generating many answers in parallel, you may have the model improve and refine its answer in multiple successive steps. Another approach is to change the verification mechanism that selects the best-generated answer. You may mix parallel and sequential sampling along with multiple verification strategies and search algorithms to obtain an even richer landscape of inference-time optimization strategies.

To determine the optimal temporal inference strategy, the researchers define an “optimal test-time computational scaling strategy” as “a strategy that selects hyperparameters appropriate to a given test-time strategy in order to obtain maximum performance benefits at a given test-time prompt.”

“Ideally, test-time computations should modify the distribution in such a way as to generate better results than would be the case with naive sampling from LLM alone,” the researchers write.

Different ways to use inference time calculations

Researchers have explored two principal strategies for using computational reasoning to improve LLM performance. The first strategy focuses on modifying the distribution of propositions, the process by which LLM generates answers. This may be achieved by tuning LLM to iteratively revise answers in complex reasoning settings.

The second strategy is to optimize the verifier, the mechanism used to select the best answer from the generated answers. This may be done by training a process-based reward model that evaluates the correctness of individual steps in the answer.

To evaluate their approach, the researchers conducted experiments with each methods on a difficult MATHEMATICS reference point using PaLM-2 models.

“For both approaches, we found that the performance of a particular computational strategy during testing depends strongly on both the nature of the specific problem and the underlying LLM model used,” the researchers write.

For easier problems, where the underlying LLM can already generate reasonable answers, allowing the model to iteratively refine its initial answer proved more practical than generating many samples in parallel. For harder problems, which require exploring different solution strategies, they found that resampling many answers in parallel or implementing a tree search based on a process-based reward model was more practical.

Different strategies for verifying answers

“This finding illustrates the need to implement an adaptive ‘computational optimization’ strategy to scale computation during testing, in which the specific approach to using computation during testing is selected on a prompt-by-prompt basis to make the most of the additional computation,” the researchers write.

By properly allocating computing power during testing, the researchers were able to significantly improve performance, surpassing the best baseline result N while using only about 25% of the computing power.

Balancing test-time calculations with pre-training calculations

The researchers also investigated the extent to which test-time computations could replace additional pretraining. They compared the performance of a smaller model with additional test-time computations with a 14-times larger model with more pretraining.

For questions with difficulty and difficulty level, the smaller model with additional computations during testing performed comparably to the larger, pre-trained model.

“This finding suggests that rather than focusing solely on scaling pretraining, in some situations it is more effective to pretrain smaller models with fewer computations and then apply computations at test time to improve model performance,” the researchers write.

However, for the most difficult questions, additional pretraining computations were found to be more practical. This indicates that current approaches to scaling computations at test time might not be an ideal substitute for scaling pretraining in all scenarios.

The researchers suggest several directions for further research, including exploring more complex strategies that mix different repetition and search techniques and developing more efficient methods for estimating query difficulty.

“Generally, [our study] suggests that even with a fairly naive methodology, scaling the computation at test time may already be preferable to scaling pretraining, and that only more improvements can be achieved as the test time strategy matures,” the researchers write. “In the long term, this points to a future in which fewer FLOPs are spent on pretraining and more FLOPs are spent on inference.”

Latest Posts

Advertisement

More from this stream

Recomended