Researchers in Together AI AND Agent They released Deepcoder-14B, a latest coding model that gives impressive performance comparable to leading their very own models reminiscent of O3-Mini OpenAI.
This model, built on the Deepseek-R1, provides greater flexibility in integration with high code performance and the ability to reason in real applications. Importantly, teams fully in addition to the model, its training data, code, journals and system optimizations that will help researchers improve their work and speed up progress.
Competitive coding options in a smaller package
Experiments of the research team show that DeepCoder-14B is strongly operating in several difficult coding test tests, including LivecodeBench (LCB), Code and HumaneVal+.
“Our model shows good results in all coding reference tests … comparable to O3-Mini (Low) and O1 performance, write scientists Blog post which describes the model.
Interestingly, despite the training primarily in the scope of coding tasks, the model shows higher mathematical reasoning, gaining 73.8% in Aime 2024, which is an improvement by 4.1% in relation to the basic model (Deepseek-R1-Distill-Qwen-14B). This suggests that reasoning skills developed through RL in the code could be effectively generalized to other domains.
The most striking aspect is achieving this performance level with only 14 billion parameters. This makes Deepcoder much smaller and potentially more efficient in launching than many border models.
Innovations that drive Deepcoder performance
When developing a model, scientists have solved some of the key challenges in the field of coaching coding models using reinforcement learning (RL).
The first challenge was tupporting training data. Learning reinforcement requires reliable reward signals indicating that the model output is correct. As scientists note: “Unlike mathematics-on, on abundant high-quality verifiable data is easily accessible on the internet-domestic coding suffers from a relative deficiency of such data.”
To solve this problem, the Deepcoder syndrome has implemented a strict pipeline that accumulates examples from different data sets and filters them to be valid, complexity and reproduction. This process caused 24,000 top quality problems, which is a solid basis for effective RL training.
The team also designed a easy prize function, which provides a positive signal only if the generated code undergoes all unit samples for the problem at a specific time. In combination with examples of high -quality training, this prize system focused on results prevents learning models, reminiscent of printing remembered answers to public tests or optimization for easy edge cases without solving a basic problem.
The basic model training algorithm is based on a relative policy optimization group (GPO), a reinforcement algorithm that proved to be very effective in Deepseek-R1. However, the team made several algorithm to make it more stable and enable the model to further improve when the training extends for longer.

Finally, the team expanded the context window of the model iteratively, first training it on shorter reasoning sequences and progressively increasing the length. They also developed a filtering method to avoid punishing the model when he created reasoning chains that have exceeded contextual limits when solving a hard line.

Scientists explain the basic idea: “To maintain the reasoning of a long context, while enabling efficient training, we turned on the filter browsing … This technique masks cut sequences during training, so that the models are not punished for generating thoughtful, but long exits that exceed the current context limit.”
The training was progressively scaled from the 16K to 32K context window, and the resulting model may solve problems that required as much as 64 tokens.
Optimization of long contact RL training
Training large models from RL, especially in terms of tasks requiring long -generated sequences, reminiscent of coding or complex reasoning, is intense and slow. The principal bottleneck is the “sampling” stage, in which the model generates potentially 1000’s of tokens, for example in a party. Variants of response length mean that some answers end much later than others, leaving the GPU inactivity and slowing the entire training loop.
To speed up this, the team was developed by Verl-Pipelin, optimized extension of the Verl Open Source library for Learning to strengthen based on human feedback (RLHF). The key innovation, which they call a “one -time pipeline”, change with sampling of answers and updating the model to shorten the bottlenecks and time of inactivity.

Their experiments have shown that disposable pipelines provided as much as 2x speed for encoding RL tasks in comparison with output implementation. This optimization was crucial for Deepcoder training in a reasonable time (2.5 weeks to 32 H100) and is currently open as a part of Verl-Pipelin, so that the community can use and build.
Influence of the enterprise
Scientists have provided all artifacts for training and conducting DeepCoder-14B Girub AND Hugging under the permissible license.
“By fully sharing our set of data, code and training provision, we enable the community to restore our work and make RL training for everyone,” write scientists.
Deepcoder-14B strongly illustrates a wider, accelerating trend in the AI landscape: an increase in highly talented, but efficient and open models.
In the case of the company’s world, this variation means more options and higher availability of advanced models. The most recent performance is now not only a field of hyperstanists or people willing to pay premium API fees. Models reminiscent of Deepcoder can enable organizations of all sizes to make use of sophisticated code generation and reasoning, adapt solutions to their specific needs and safely implement them in their environment.
This trend can reduce the entry barrier to AI adoption and support a more competitive and revolutionary ecosystem in which progress is conducted by Open Source cooperation.
