Skip to content

Instantly share code, notes, and snippets.

@awdemos
Last active January 28, 2025 16:47
Show Gist options
  • Save awdemos/5dd71f6a651045f5329d4bb2a1842fc1 to your computer and use it in GitHub Desktop.
Save awdemos/5dd71f6a651045f5329d4bb2a1842fc1 to your computer and use it in GitHub Desktop.
Run DeepSeek R1 Dynamic 1.58-bit with Unsloth

Introducing DeepSeek R1: A New Era in Open-Source AI

Version 1.58-bit Dynamic
Released: January 27, 2025
Authors: Andrew

DeepSeek R1 is making headlines by challenging OpenAI's O1 reasoning model, all while being completely open-source. We've worked on making it more accessible for local users by reducing the model's size from 720GB to 131GB, an impressive 80% reduction, without compromising its functionality.

By analyzing DeepSeek R1's structure, the Unsloth team was selectively able to quantize certain layers to higher bits (like 4-bit) while keeping most MoE layers at 1.5-bit. This approach prevents the model from producing errors like endless loops or nonsensical outputs, which occur if all layers are naively quantized.

The 1.58-bit version requires about 160GB of VRAM for optimal performance, achieving around 140 tokens per second. However, it can also run on 20GB of RAM, albeit more slowly. For best results, a combined VRAM and RAM of at least 80GB is recommended.

They uploaded various dynamic quantized versions, ranging from 131GB to 212GB, on Hugging Face.

Dynamic Quantized Versions

They offer four dynamic quantized versions. The first three use an importance matrix to optimize the quantization process, while the last is a general 2-bit quantization without calibration.

MoE Bits Disk Size Type Quality Link
1.58-bit 131GB IQ1_S Fair Link
1.73-bit 158GB IQ1_M Good Link
2.22-bit 183GB IQ2_XXS Better Link
2.51-bit 212GB Q2_K_XL Best Link

These versions are compatible with both distilled and non-distilled models, though hardware requirements may vary.

Benchmarks and Testing

Instead of standard benchmarks, Unsloth tasked DeepSeek R1 with creating a Flappy Bird game, evaluating it on ten criteria. The dynamic 1.58-bit version consistently produced valid outputs, unlike naive quantizations that resulted in repetitive or incorrect outputs.

Exploiting DeepSeek R1’s Architecture

Their analysis revealed that the first three layers of DeepSeek are dense, not MoE. MoE layers allow for more parameters without increasing computational cost by skipping zeroed-out entries. By combining several quantization techniques, we optimized the model's performance while maintaining precision where needed.

Running Dynamic Quants

You don't need a new version of llama.cpp to run dynamic quants. Any system capable of running GGUFs can handle them, though performance may vary based on available VRAM and RAM.

For detailed instructions on running the model and more insights, visit our documentation.

Support Us

If you appreciate our work, please consider starring us on GitHub Your support means a lot! 💖

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment