-
DeepSeek used GRPO (Group Reward Policy Optimization), a variant of PPO (Proximal Policy Optimization) for training
-
GRPO differs from PPO in several ways:
- Samples multiple completions (G of them) instead of just one
- Doesn't use a separate value network, instead using the group of completions as a Monte Carlo estimate
- Uses a reference policy (SFT model) with KL divergence to prevent drift
-
The model is based on a distilled version of Qwen 7B (7.62 billion parameters)
-
They used a custom 8-bit format (F8 E4M3) for lower precision training
-
Employed mixture of experts techniques
-
Used H800 GPUs, which have the same compute as H100s but lower interconnect bandwidth
-
Wrote custom PTX assembly code to optimize GPU usage:
- Programmed 20 of 132 processing units to manage cross-chip communications
- This allowed them to achieve H100-like performance on H800s
-
Achieved significant cost savings (claimed 40x reduction) through various optimizations:
- Removing the value network
- Custom low-precision formats
- GPU optimizations
- Mixture of experts techniques
- Used math datasets like AIMO NumenMath TIR for training
- Employed reinforcement learning techniques to improve reasoning capabilities
- Focused on allowing longer reasoning traces before producing final answers
- Achieved strong performance on math and reasoning tasks
- Demonstrated ability to perform multi-step reasoning
- Showed improvements over supervised fine-tuning with RLHF approaches
DeepSeek's success came from a combination of algorithmic innovations, hardware optimizations, and efficient training techniques, rather than any single breakthrough.