Scientists from Google Cloud AND University of California proposed a new reinforcement learning framework that significantly improves the ability of language models to learn very difficult multi-step reasoning tasks. Supervised reinforcement learning (SRL) reframes problem solving as a sequence of logical “actions,” providing wealthy learning signals during the training process.
This approach enables smaller models to learn complex problems that were previously beyond the reach of other common training techniques. Experiments show that SRL not only excels in mathematical reasoning benchmarks, but also generalizes effectively to agent software engineering tasks.
SRL is a comprehensive training platform that may elevate smaller and cheaper models to higher reasoning skills.
The limits of current LLM reasoning training
Recent progress in training large language models (LLMs) for reasoning is largely resulting from reinforcement learning with verifiable rewards (RLVR), a method in which the model is rewarded based on the correctness of its final answer. By repeatedly trying to resolve problems and obtaining feedback on the outcome, the model progressively learns effective problem-solving strategies.
However, the success of this outcome-based approach depends on the model’s ability to search out the correct solution inside a limited variety of trials, or “deployments.” Since each implementation is computationally expensive, models can’t be tried indefinitely. This method hits home when the problems are so difficult that the model rarely, if ever, finds the right answer inside its budget.
This creates a critical bottleneck in the learning process. For many multi-step reasoning problems, the model can solve several steps appropriately, but a single error can derail it, resulting in an incorrect answer. In the case of RLVR, all this effort is negatively rewarded, and the model learns nothing from its partially correct work. It is an all-or-nothing approach that gives no detailed feedback and provides scant rewards.
An alternative method is supervised tuning (SFT), in which the model learns from examples containing the full reasoning process developed by experts. While SFT can instill reasoning skills, it often results in overfitting (the model simply learns to mimic trajectories from the training data, reasonably than learning to generalize to problems beyond the examples it has seen). This problem is exacerbated by the proven fact that high-quality, human-made training data is each rare and expensive to provide.
As the article notes, these limitations leave “a critical gap in training small open source models to effectively learn difficult problems.”
How supervised reinforcement learning works
SRL introduces a framework that reframes problem solving as a “sequential decision-making process,” striking a balance between pure performance-based RL and pure imitation learning. Instead of optimizing only for the final answer or forcing the model to mimic the expert’s entire thought process, SRL trains the model to breed the sequence of key actions that form the basis of the expert’s reasoning. This allows the model to learn to act like an expert while developing its own internal reasoning style.
Within SRL, expert demonstrations are broken down into a series of intermediate, concrete actions, each of which constitutes a significant step. In a math problem, the motion could also be an algebraic manipulation. In the case of a software engineering agent, this may increasingly be a command executed in a code repository. To generate training data, SRL uses a powerful teacher model to create solution trajectories, which are then used to coach a smaller model.
According to I-Hung Hsu, a researcher at Google and co-author of the paper, this indirect approach is the key to its effectiveness in real-world scenarios. “SRL is in the middle: it captures the structured flexibility of solving real-world problems where there are many correct strategies, but also clear ideas about what ‘good reasoning’ looks like at every step,” Hsu told VentureBeat. “This makes SRL suitable for fields such as data analysis automation or perhaps supply chain optimization – tasks that reward sound intermediate reasoning rather than mere definitive answers.”
During training, the model first generates an “internal monologue” (its internal reasoning process, included in
SRL in motion
The researchers’ experiments show that SRL significantly outperforms strong baselines on each difficult math tests and agent-based software engineering tests. They also observed that SRL encourages more flexible and sophisticated reasoning patterns in models, similar to interleaved planning and self-verification, which improve the quality of solutions without merely prolonging results.
For business leaders, productivity gains are only useful if they do not come at the cost of uncontrollable costs. Hsu explains that models trained with SRL are more efficient in their reasoning. “The benefits come from better quality and structure of reasoning, not from verbosity,” he said. “In terms of performance, models trained with SRL are roughly comparable to the baseline model in terms of token usage… although SRL is not designed to reduce inference cost, it achieves better inference performance without increasing it.”
The team tuned in for math tests Qwen2.5-7B-Instruction on a dataset containing 1,000 difficult math questions. They compared its performance to models trained with SFT and RLVR (using the GRPO algorithm common in models similar to DeepSeek-R1) on 4 competition-level mathematical patterns. The model trained with SRL achieved a significant average performance increase of three.0% in comparison with other methods.
The team prolonged SRL to agent-based software engineering, a domain critical to enterprise automation. They trained a model specializing in coding, Qwen2.5-Coder-7B manualon 5000 expert trajectories of agents interacting with the coding environment. The SRL-trained model was compared with the original baseline model and SWE-Gym-7B, a strong baseline model refined with SFT. SRL achieved a task resolution rate of 14.8%, a relative improvement of 74% in comparison with the SFT-based model. This demonstrates SRL’s ability to coach more competent AI agents to perform complex real-world programming tasks.
A new high-stakes standard for artificial intelligence?
The best results in this text were achieved by combining methods: first, using SRL to show basic reasoning, and then using RLVR to refine this skill. In their experiments, when the researchers used SRL pre-training and RLVR post-training, they observed an average increase of three.7%, demonstrating an effective curriculum-based teaching strategy.
The query arises whether this might turn into a new blueprint for building specialized artificial intelligence.
“We see SRL as a strong foundation,” Hsu said. “In a sense, SRL provides a curriculum—teaching models of thinking and acting step by step—before we refine those behaviors through performance-based reinforcement learning. This SRL-focused approach not only stabilizes the later stage of RL, but also makes reasoning more understandable and generalizable, which is crucial in high-stakes applications.”
Looking ahead, Hsu acknowledges that there are still challenges to scaling this pipeline, especially the high cost and complexity of end-to-end RLVR for agent-based tasks. However, he is optimistic about the further development path. “While high-quality expert trajectories remain important,” he concluded, “we believe the next big step will come from automating their generation and filtering – by using robust teacher models and even self-improving student models to load new data.”
