Skip to content

Instantly share code, notes, and snippets.

@awdemos
Created February 2, 2025 02:20
Show Gist options
  • Save awdemos/9a059198ad1658a5b9792f45747f4116 to your computer and use it in GitHub Desktop.
Save awdemos/9a059198ad1658a5b9792f45747f4116 to your computer and use it in GitHub Desktop.
Deepseek Model Architecture and Training

Model Architecture and Training

  • DeepSeek used GRPO (Group Reward Policy Optimization), a variant of PPO (Proximal Policy Optimization) for training

  • GRPO differs from PPO in several ways:

    • Samples multiple completions (G of them) instead of just one
    • Doesn't use a separate value network, instead using the group of completions as a Monte Carlo estimate
    • Uses a reference policy (SFT model) with KL divergence to prevent drift
  • The model is based on a distilled version of Qwen 7B (7.62 billion parameters)

  • They used a custom 8-bit format (F8 E4M3) for lower precision training

  • Employed mixture of experts techniques

Hardware and Optimization

  • Used H800 GPUs, which have the same compute as H100s but lower interconnect bandwidth

  • Wrote custom PTX assembly code to optimize GPU usage:

    • Programmed 20 of 132 processing units to manage cross-chip communications
    • This allowed them to achieve H100-like performance on H800s
  • Achieved significant cost savings (claimed 40x reduction) through various optimizations:

    • Removing the value network
    • Custom low-precision formats
    • GPU optimizations
    • Mixture of experts techniques

Training Data and Process

  • Used math datasets like AIMO NumenMath TIR for training
  • Employed reinforcement learning techniques to improve reasoning capabilities
  • Focused on allowing longer reasoning traces before producing final answers

Performance and Capabilities

  • Achieved strong performance on math and reasoning tasks
  • Demonstrated ability to perform multi-step reasoning
  • Showed improvements over supervised fine-tuning with RLHF approaches

DeepSeek's success came from a combination of algorithmic innovations, hardware optimizations, and efficient training techniques, rather than any single breakthrough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment