Run DeepSeek R1 Dynamic 1.58-bit with Unsloth

Introducing DeepSeek R1: A New Era in Open-Source AI

Version 1.58-bit Dynamic
Released: January 27, 2025
Authors: Andrew

DeepSeek R1 is making headlines by challenging OpenAI's O1 reasoning model, all while being completely open-source. We've worked on making it more accessible for local users by reducing the model's size from 720GB to 131GB, an impressive 80% reduction, without compromising its functionality.

By analyzing DeepSeek R1's structure, the Unsloth team was selectively able to quantize certain layers to higher bits (like 4-bit) while keeping most MoE layers at 1.5-bit. This approach prevents the model from producing errors like endless loops or nonsensical outputs, which occur if all layers are naively quantized.

The 1.58-bit version requires about 160GB of VRAM for optimal performance, achieving around 140 tokens per second. However, it can also run on 20GB of RAM, albeit more slowly. For best results, a combined VRAM and RAM of at least 80GB is recommended.

They uploaded various dynamic quantized versions, ranging from 131GB to 212GB, on Hugging Face.

Dynamic Quantized Versions

They offer four dynamic quantized versions. The first three use an importance matrix to optimize the quantization process, while the last is a general 2-bit quantization without calibration.

MoE Bits	Disk Size	Type	Quality	Link
1.58-bit	131GB	IQ1_S	Fair	Link
1.73-bit	158GB	IQ1_M	Good	Link
2.22-bit	183GB	IQ2_XXS	Better	Link
2.51-bit	212GB	Q2_K_XL	Best	Link

These versions are compatible with both distilled and non-distilled models, though hardware requirements may vary.

Benchmarks and Testing

Instead of standard benchmarks, Unsloth tasked DeepSeek R1 with creating a Flappy Bird game, evaluating it on ten criteria. The dynamic 1.58-bit version consistently produced valid outputs, unlike naive quantizations that resulted in repetitive or incorrect outputs.

Exploiting DeepSeek R1’s Architecture

Their analysis revealed that the first three layers of DeepSeek are dense, not MoE. MoE layers allow for more parameters without increasing computational cost by skipping zeroed-out entries. By combining several quantization techniques, we optimized the model's performance while maintaining precision where needed.

Running Dynamic Quants

You don't need a new version of llama.cpp to run dynamic quants. Any system capable of running GGUFs can handle them, though performance may vary based on available VRAM and RAM.

For detailed instructions on running the model and more insights, visit our documentation.

Support Us

If you appreciate our work, please consider starring us on GitHub Your support means a lot! 💖

awdemos/Run DeepSeek R1 Dynamic 1.58-bit with Unsloth.md