Introducing DeepSeek R1: A New Era in Open-Source AI
Version 1.58-bit Dynamic
Released: January 27, 2025
Authors: Andrew
DeepSeek R1 is making headlines by challenging OpenAI's O1 reasoning model, all while being completely open-source. We've worked on making it more accessible for local users by reducing the model's size from 720GB to 131GB, an impressive 80% reduction, without compromising its functionality.
By analyzing DeepSeek R1's structure, the Unsloth team was selectively able to quantize certain layers to higher bits (like 4-bit) while keeping most MoE layers at 1.5-bit. This approach prevents the model from producing errors like endless loops or nonsensical outputs, which occur if all layers are naively quantized.
The 1.58-bit version requires about 160GB of VRAM for optimal performance, achieving around 140 tokens per second. However, it can also run on 20GB of RAM, albeit more slowly. For best results, a combined VRAM and RAM of at least 80GB is recommended.
They uploaded various dynamic quantized versions, ranging from 131GB to 212GB, on Hugging Face.
Dynamic Quantized Versions
They offer four dynamic quantized versions. The first three use an importance matrix to optimize the quantization process, while the last is a general 2-bit quantization without calibration.
MoE Bits | Disk Size | Type | Quality | Link |
---|---|---|---|---|
1.58-bit | 131GB | IQ1_S | Fair | Link |
1.73-bit | 158GB | IQ1_M | Good | Link |
2.22-bit | 183GB | IQ2_XXS | Better | Link |
2.51-bit | 212GB | Q2_K_XL | Best | Link |
These versions are compatible with both distilled and non-distilled models, though hardware requirements may vary.
Benchmarks and Testing
Instead of standard benchmarks, Unsloth tasked DeepSeek R1 with creating a Flappy Bird game, evaluating it on ten criteria. The dynamic 1.58-bit version consistently produced valid outputs, unlike naive quantizations that resulted in repetitive or incorrect outputs.
Exploiting DeepSeek R1’s Architecture
Their analysis revealed that the first three layers of DeepSeek are dense, not MoE. MoE layers allow for more parameters without increasing computational cost by skipping zeroed-out entries. By combining several quantization techniques, we optimized the model's performance while maintaining precision where needed.
Running Dynamic Quants
You don't need a new version of llama.cpp to run dynamic quants. Any system capable of running GGUFs can handle them, though performance may vary based on available VRAM and RAM.
For detailed instructions on running the model and more insights, visit our documentation.
Support Us
If you appreciate our work, please consider starring us on GitHub Your support means a lot! 💖