deepseek r1 gguf unsloth

See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants selectively avoids quantizing certain parameters, greatly increasing accuracy than standard 1-bit/2-bit.

Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic

Do not forget about <｜User｜> and <｜Assistant｜> tokens! - Or use a chat template formatter Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp It's best to use --min-p 0.05 or 0.1 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

   ./llama.cpp/llama-cli \
      --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
      --cache-type-k q4_0 \
      --threads 12 -no-cnv --prio 2 \
      --temp 0.6 \
      --ctx-size 8192 \
      --seed 3407 \
      --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

Example output:

Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly. Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense. Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything. I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right. Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...

If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

./llama.cpp/llama-cli \
 --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
 --cache-type-k q4_0 \
 --threads 12 -no-cnv --prio 2 \
 --n-gpu-layers 7 \
 --temp 0.6 \
 --ctx-size 8192 \
 --seed 3407 \
 --prompt "<｜User｜>Create a Flappy Bird game in Python.<｜Assistant｜>"

If you want to merge the weights together, use this script:

   ./llama.cpp/llama-gguf-split --merge \
    DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    merged_file.gguf

MoE Bits Type Disk Size Accuracy Link Details 1.58bit IQ1_S 131GB Fair Link MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit 1.73bit IQ1_M 158GB Good Link MoE all 1.56bit. down_proj in MoE left at 2.06bit 2.22bit IQ2_XXS 183GB Better Link MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit 2.51bit Q2_K_XL 212GB Best Link MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb

✨ Finetune for Free All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use Llama-3.2 (3B) ▶️ Start on Colab 2.4x faster 58% less Llama-3.2 (11B vision) ▶️ Start on Colab 2x faster 60% less Qwen2 VL (7B) ▶️ Start on Colab 1.8x faster 60% less Qwen2.5 (7B) ▶️ Start on Colab 2x faster 60% less Llama-3.1 (8B) ▶️ Start on Colab 2.4x faster 58% less Phi-3.5 (mini) ▶️ Start on Colab 2x faster 50% less Gemma 2 (9B) ▶️ Start on Colab 2.4x faster 58% less Mistral (7B) ▶️ Start on Colab 2.2x faster 62% less

This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates. This text completion notebook is for raw text. This DPO notebook replicates Zephyr.

Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks

bigsnarfdude/unsloth_deepseek.md

bigsnarfdude commented Jan 28, 2025