See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants selectively avoids quantizing certain parameters, greatly increasing accuracy than standard 1-bit/2-bit.
Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic
Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp It's best to use --min-p 0.05 or 0.1 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode
./llama.cpp/llama-cli \
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --prio 2 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
Example output:
Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly. Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense. Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything. I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right. Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.
./llama.cpp/llama-cli \
--model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--cache-type-k q4_0 \
--threads 12 -no-cnv --prio 2 \
--n-gpu-layers 7 \
--temp 0.6 \
--ctx-size 8192 \
--seed 3407 \
--prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"
If you want to merge the weights together, use this script:
./llama.cpp/llama-gguf-split --merge \
DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
merged_file.gguf
MoE Bits Type Disk Size Accuracy Link Details 1.58bit IQ1_S 131GB Fair Link MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit 1.73bit IQ1_M 158GB Good Link MoE all 1.56bit. down_proj in MoE left at 2.06bit 2.22bit IQ2_XXS 183GB Better Link MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit 2.51bit Q2_K_XL 212GB Best Link MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb
✨ Finetune for Free All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.
Unsloth supports Free Notebooks Performance Memory use
Llama-3.2 (3B)
This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates. This text completion notebook is for raw text. This DPO notebook replicates Zephyr.
- Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks