DeepSeek-V3, an open-source very large AI, outperforms Llama and Qwen at launch

DeepSeek-V3, an open-source very large AI, outperforms Llama and Qwen at launch


Chinese AI startup DeepSeek, known for difficult leading AI vendors with modern open source technologies, today launched a recent super-large model: DeepSeek-V3.

Available via Face Hugging According to the company’s licensing agreement, the recent model is equipped with 671B parameters, but uses a mixed architecture to activate only chosen parameters to perform assigned tasks accurately and efficiently. According to benchmarks provided by DeepSeek, the offering is already topping the charts, outperforming leading open source models including Meta’s Llama 3.1-405B, and closely matching the performance of Anthropic and OpenAI’s closed-source models.

- Advertisement -

This release represents one other essential development step, closing the gap between closed-source and open-source AI. Ultimately, DeepSeek, which began as an offshoot of a Chinese quantitative hedge fund Capital management at the highest levelhopes that these developments will pave the way for artificial general intelligence (AGI), in which models will give you the option to know and learn any mental task that a human can perform.

What does DeepSeek-V3 offer?

Like its predecessor DeepSeek-V2, the recent extra-large model uses the same basic architecture multiheaded covert attention (MLA) AND DeepSeekMoE. This approach ensures efficient training and inference – due to specialized and shared “experts” (individual, smaller neural networks inside a larger model) activating 37B of 671B parameters for each token.

While the basic architecture provides solid performance for DeepSeek-V3, the company has also introduced two innovations that raise the bar even further.

The first is an auxiliary lossless load balancing strategy. This feature dynamically monitors and adjusts expert workload to utilize them in a balanced manner without compromising overall model performance. The second solution is multiple token prediction (MTP), which allows the model to predict multiple future tokens at once. This innovation not only increases training efficiency, but enables the model to run three times faster, generating 60 tokens per second.

“During initial training, we trained DeepSeek-V3 on 14.8T high-quality and diverse tokens… We then performed a two-stage context length expansion for DeepSeek-V3,” the company wrote in technical paper describing the recent model in detail. “In the first stage, the maximum context length is increased to 32 KB and in the second stage to 128 KB. We then conducted post-training training including supervised improvement (SFT) and reinforcement learning (RL) on the basic DeepSeek-V3 model to adapt it to human preferences and further unlock its potential. In the post-training phase, we extract inference capabilities from the DeepSeekR1 model series while carefully maintaining a balance between model accuracy and generation length.”

It is value noting that during the training phase, DeepSeek applied many hardware and algorithmic optimizations, including the FP8 mixed-precision training platform and the DualPipe algorithm for pipeline parallelism, to scale back process costs.

Overall, he claims to have accomplished the entire DeepSeek-V3 training in roughly 2788k. hours with an H800 GPU, or roughly $5.57 million, based on a rental price of $2 per GPU hour. This is significantly lower than the tons of of hundreds of thousands of dollars typically spent on pre-training large language models.

For example, Llama-3.1 is estimated to have been trained with an investment of over $500 million.

The strongest open source model currently available

Despite its economical training, DeepSeek-V3 has turn into the strongest open source model on the market.

The company conducted multiple benchmarks to match AI performance and noted that it convincingly outperforms leading open models, including the Llama-3.1-405B and Qwen 2.5-72B. It even outperforms closed-source GPT-4o in most tests, with the exception of the English-focused SimpleQA and FRAMES – where the OpenAI model won with scores of 38.2 and 80.5 (vs. 24.9 and 73.3), respectively.

Notably, DeepSeek-V3’s performance particularly stood out in the Chinese and math benchmarks, scoring higher than all of its peers. He scored 90.2 on the Math-500 test, and his Qwen rating of 80 was the next best rating.

The only model that managed to challenge DeepSeek-V3 was Anthropic’s Claude 3.5 Sonnet, which outperformed the MMLU-Pro, IF-Eval, GPQA-Diamond, SWE Verified and Aider-Edit tests.

The work shows that open-source software is approaching closed-source models, promising near-equivalent performance across a number of tasks. The development of such systems is extremely good for the industry because it potentially eliminates the probabilities of one large AI player ruling the game. It also gives enterprises multiple options to decide on from and work with when orchestrating stacks.

Currently, the DeepSeek-V3 code is available via GitHub under the MIT license, while the model is released under the company’s model license. Enterprises can even test the recent model via DeepSeek ChatChatGPT-like platform and access the API for business use. DeepSeek provides an API on the site same price as DeepSeek-V2 until February 8. It will then charge $0.27 per million for input tokens ($0.07 per million tokens for cache hits) and $1.10 per million for output tokens.

Latest Posts

Advertisement

More from this stream

Recomended