Deepseek Model Architecture and Training

Model Architecture and Training

DeepSeek used GRPO (Group Reward Policy Optimization), a variant of PPO (Proximal Policy Optimization) for training
GRPO differs from PPO in several ways:
- Samples multiple completions (G of them) instead of just one
- Doesn't use a separate value network, instead using the group of completions as a Monte Carlo estimate
- Uses a reference policy (SFT model) with KL divergence to prevent drift
The model is based on a distilled version of Qwen 7B (7.62 billion parameters)
They used a custom 8-bit format (F8 E4M3) for lower precision training
Employed mixture of experts techniques

Hardware and Optimization

Used H800 GPUs, which have the same compute as H100s but lower interconnect bandwidth
Wrote custom PTX assembly code to optimize GPU usage:
- Programmed 20 of 132 processing units to manage cross-chip communications
- This allowed them to achieve H100-like performance on H800s
Achieved significant cost savings (claimed 40x reduction) through various optimizations:
- Removing the value network
- Custom low-precision formats
- GPU optimizations
- Mixture of experts techniques

Training Data and Process

Used math datasets like AIMO NumenMath TIR for training
Employed reinforcement learning techniques to improve reasoning capabilities
Focused on allowing longer reasoning traces before producing final answers

Performance and Capabilities

Achieved strong performance on math and reasoning tasks
Demonstrated ability to perform multi-step reasoning
Showed improvements over supervised fine-tuning with RLHF approaches

DeepSeek's success came from a combination of algorithmic innovations, hardware optimizations, and efficient training techniques, rather than any single breakthrough.

awdemos/Deepseek Model Architecture and Training.md

Model Architecture and Training

Hardware and Optimization

Training Data and Process

Performance and Capabilities