Skip to content

Instantly share code, notes, and snippets.

@bigsnarfdude
Last active February 9, 2025 23:46
Show Gist options
  • Save bigsnarfdude/817e17d4cb17371487c499f7229bea5d to your computer and use it in GitHub Desktop.
Save bigsnarfdude/817e17d4cb17371487c499f7229bea5d to your computer and use it in GitHub Desktop.
deepseek r1 gguf unsloth

See our collection for versions of Deepseek-R1 including GGUF & 4-bit formats. Unsloth's DeepSeek-R1 1.58-bit + 2-bit Dynamic Quants selectively avoids quantizing certain parameters, greatly increasing accuracy than standard 1-bit/2-bit.

Instructions to run this model in llama.cpp: Or you can view more detailed instructions here: unsloth.ai/blog/deepseekr1-dynamic

Do not forget about <|User|> and <|Assistant|> tokens! - Or use a chat template formatter Obtain the latest llama.cpp at https://github.com/ggerganov/llama.cpp It's best to use --min-p 0.05 or 0.1 to counteract very rare token predictions - I found this to work well especially for the 1.58bit model. Example with Q4_0 K quantized cache Notice -no-cnv disables auto conversation mode

   ./llama.cpp/llama-cli \
      --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
      --cache-type-k q4_0 \
      --threads 12 -no-cnv --prio 2 \
      --temp 0.6 \
      --ctx-size 8192 \
      --seed 3407 \
      --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

Example output:

Okay, so I need to figure out what 1 plus 1 is. Hmm, where do I even start? I remember from school that adding numbers is pretty basic, but I want to make sure I understand it properly. Let me think, 1 plus 1. So, I have one item and I add another one. Maybe like a apple plus another apple. If I have one apple and someone gives me another, I now have two apples. So, 1 plus 1 should be 2. That makes sense. Wait, but sometimes math can be tricky. Could it be something else? Like, in a different number system maybe? But I think the question is straightforward, using regular numbers, not like binary or hexadecimal or anything. I also recall that in arithmetic, addition is combining quantities. So, if you have two quantities of 1, combining them gives you a total of 2. Yeah, that seems right. Is there a scenario where 1 plus 1 wouldn't be 2? I can't think of any...

If you have a GPU (RTX 4090 for example) with 24GB, you can offload multiple layers to the GPU for faster processing. If you have multiple GPUs, you can probably offload more layers.

./llama.cpp/llama-cli \
 --model DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
 --cache-type-k q4_0 \
 --threads 12 -no-cnv --prio 2 \
 --n-gpu-layers 7 \
 --temp 0.6 \
 --ctx-size 8192 \
 --seed 3407 \
 --prompt "<|User|>Create a Flappy Bird game in Python.<|Assistant|>"

If you want to merge the weights together, use this script:

   ./llama.cpp/llama-gguf-split --merge \
    DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
    merged_file.gguf

MoE Bits Type Disk Size Accuracy Link Details 1.58bit IQ1_S 131GB Fair Link MoE all 1.56bit. down_proj in MoE mixture of 2.06/1.56bit 1.73bit IQ1_M 158GB Good Link MoE all 1.56bit. down_proj in MoE left at 2.06bit 2.22bit IQ2_XXS 183GB Better Link MoE all 2.06bit. down_proj in MoE mixture of 2.5/2.06bit 2.51bit Q2_K_XL 212GB Best Link MoE all 2.5bit. down_proj in MoE mixture of 3.5/2.5bit Finetune LLMs 2-5x faster with 70% less memory via Unsloth! We have a free Google Colab Tesla T4 notebook for Llama 3.1 (8B) here: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-Alpaca.ipynb

✨ Finetune for Free All notebooks are beginner friendly! Add your dataset, click "Run All", and you'll get a 2x faster finetuned model which can be exported to GGUF, vLLM or uploaded to Hugging Face.

Unsloth supports Free Notebooks Performance Memory use Llama-3.2 (3B) ▶️ Start on Colab 2.4x faster 58% less Llama-3.2 (11B vision) ▶️ Start on Colab 2x faster 60% less Qwen2 VL (7B) ▶️ Start on Colab 1.8x faster 60% less Qwen2.5 (7B) ▶️ Start on Colab 2x faster 60% less Llama-3.1 (8B) ▶️ Start on Colab 2.4x faster 58% less Phi-3.5 (mini) ▶️ Start on Colab 2x faster 50% less Gemma 2 (9B) ▶️ Start on Colab 2.4x faster 58% less Mistral (7B) ▶️ Start on Colab 2.2x faster 62% less

This Llama 3.2 conversational notebook is useful for ShareGPT ChatML / Vicuna templates. This text completion notebook is for raw text. This DPO notebook replicates Zephyr.

  • Kaggle has 2x T4s, but we use 1. Due to overhead, 1x T4 is 5x faster. Special Thanks
@bigsnarfdude
Copy link
Author

pip install mlx-lm

from mlx_lm import load, generate

model, tokenizer = load("mlx-community/DeepSeek-R1-3bit")

prompt = "hello"

if tokenizer.chat_template is not None:
    messages = [{"role": "user", "content": prompt}]
    prompt = tokenizer.apply_chat_template(
        messages, add_generation_prompt=True
    )

response = generate(model, tokenizer, prompt=prompt, verbose=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment