How Deepseek innovation in LLM training and architecture changed the course of AI.

DeepSeek, a Chinese AI firm, has recently made waves in the AI industry by developing powerful large language models (LLMs) at a fraction of the cost of their competitors, primarily through innovative training techniques. This is particularly notable given the restrictions on the export of high-performance AI chips to China. Here’s a breakdown of how DeepSeek achieved this:

Reinforcement Learning (RL) Focus: DeepSeek has significantly invested in large-scale reinforcement learning focused on reasoning tasks. Instead of relying heavily on supervised fine-tuning (SFT) as a preliminary step, DeepSeek directly applies RL to the base model. This allows the model to explore chain-of-thought (CoT) reasoning for solving complex problems and develop capabilities such as self-verification and reflection. DeepSeek’s approach allows models to evolve reasoning capabilities autonomously.

Rule-Based Reward System: Instead of using neural reward models, DeepSeek developed a rule-based reward system. This system evaluates the correctness of responses and enforces a specific output format with thinking processes between designated tags. This approach simplifies the training process and avoids the potential for reward hacking often encountered with neural reward models.

Cold Start Data: To further enhance the RL process, DeepSeek incorporates a small amount of high-quality, human-friendly “cold-start” data. This data is used to fine-tune the model before RL, providing a more stable starting point and making the output more readable. The data includes a summary at the end of each response and is filtered to ensure it is user-friendly. This process contrasts with DeepSeek-R1-Zero, which does not use any SFT data.

Distillation: DeepSeek uses efficient knowledge transfer techniques to distill the reasoning capabilities of large models into smaller ones. This approach allows for the creation of smaller models with strong performance. For example, DeepSeek-R1-Distill-Qwen-7B achieves a 55.5% score on AIME 2024, outperforming the QwQ-32B-Preview. The company open-sources these distilled models, including 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints. Distillation proves more efficient than solely relying on large-scale RL training on smaller models.

Emergent Behaviour: DeepSeek discovered that complex reasoning patterns can emerge naturally through reinforcement learning without explicit programming, referred to as “emergent behavior”. This means the model develops sophisticated reasoning strategies on its own, such as revisiting and reevaluating previous steps and exploring alternative approaches. An “aha moment” was observed during the training of DeepSeek-R1-Zero, when the model learned to allocate more thinking time by reevaluating its approach.

Lower Cost and Faster Retraining: The company claims to have developed its R1 model for less than $6 million, a stark contrast to the hundreds of millions of dollars spent by competitors. The company’s approach is also more efficient as it uses less time and fewer AI accelerators during training. By using a rule based reward model, less computational resources are needed because they do not need additional training and do not require additional time.

In summary, DeepSeek has leveraged a combination of innovative approaches to optimise chip efficiency and achieve faster retraining at a lower cost. Their focus on reinforcement learning, rule-based reward systems, cold-start data, distillation techniques, and the harnessing of emergent behaviour has allowed them to create high-performing LLMs while overcoming resource limitations. These innovations have allowed DeepSeek to challenge U.S. tech giants in the AI space.