Are any of these local sized models any good at code generation?
This is actually quite a complex question that needs to be broken down a bit.
They're a bit all-over-the-place, for example Claude 3.5 performs best in SWE Verified with 50.8, but comes very far behind Deepseek R1 in LiveCodeBench (38.9 vs 65.9), and o1 in Codeforces (2061 vs 717)
There's an argument that chain of thought (CoT) models do better, because they argue with themselves a bit before giving their answer.
- https://www.ibm.com/think/topics/chain-of-thoughts
- Less is More Reasoning: https://arxiv.org/pdf/2502.03387
Let's say it's Claude 3.5, o1, and R1.
The problem with proprietary models is that you can't really know what their parameter size is unless their foundry publishes that info, which they usually don't.
This article suggests (without a source) that Claude 3.5 is 175B, and let's say it's trained with f16, you'd need at least 350GB of RAM to load Claude into memory.
If you downsample a 16-bit float into an 8-bit or 4-bit float you may or may not lose information, depending on what the original number was. According to research compressing a model beyond 4-bit floats causes a deterioration in performance, but 8-bits is largely comparable to the uncompressed equivalent A Comprehensive Evaluation of Quantization Strategies for Large Language Models.
In another paper, How Does Quantization Affect Multilingual LLMs?, they conclude "Challenging tasks degrade fast and severely (e.g. mathematical reasoning and responses to realistic challenging prompts)", and "We find that: (1) Damage from quantization is much worse than appears from automatic metrics: even when not observed automatically, human evaluators notice it.".
Red Hat compared various quantisations of Llama 3.1 (down to int8, no 4-bit comparisons), and found that their performance was comparable.
I speculate that for long sessions 8-bit quantisation is probably okay, but for 4-bit quantisation the low precision will compound and result in poor responses.
So with 8-bit quantisation I'd only need 175GB of RAM to run Claude 3.5, that's becoming attainable.
There's also dynamic quantisation, where only the weights are quantised and the intermediate results are not quantised - this could help the precision collapse in long sessions I mentioned.
In January, Deepseek famously published DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. The R1 model (which is freely available) is a state-of-the-art model that performs in the same level as Claude 3.5, GPT 4o, and o1-mini.
R1 is a 671B parameter model, but the activated parameters total 37B.
Deepseek used a technique called distillation, where a larger model is prompted produce training data to fine-tune a smaller model. They produced several smaller (70B, 32B, 14B, 8B, etc.) models using Qwen and Llama as foundation models, and R1 for the distillation set. (It's widely accepted that they distilled o1 to create R1.)
Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger- scale reinforcement learning.
Model | AIME 2024 (pass@1) | AIME 2024 (cons@64) | MATH-500 (pass@1) | GPQA Diamond (pass@1) | LiveCode Bench (pass@1) | CodeForces (rating) |
---|---|---|---|---|---|---|
GPT-4o-0513 | 9.3 | 13.4 | 74.6 | 49.9 | 32.9 | 759 |
Claude-3.5-Sonnet-1022 | 16.0 | 26.7 | 78.3 | 65.0 | 38.9 | 717 |
OpenAI-o1-mini | 63.6 | 80.0 | 90.0 | 53.8 | 53.8 | 1820 |
QwQ-32B-Preview | 50.0 | 60.0 | 90.6 | 54.5 | 41.9 | 1316 |
DeepSeek-R1-Distill-Qwen-1.5B | 28.9 | 52.7 | 83.9 | 33.8 | 16.9 | 954 |
DeepSeek-R1-Distill-Qwen-7B | 55.5 | 83.3 | 92.8 | 49.1 | 37.6 | 1189 |
DeepSeek-R1-Distill-Qwen-14B | 69.7 | 80.0 | 93.0 | 53.9 | 53.1 | 1481 |
DeepSeek-R1-Distill-Qwen-32B | 72.6 | 83.3 | 93.0 | 57.2 | 57.2 | 1691 |
DeepSeek-R1-Distill-Llama-8B | 50.4 | 80.0 | 89.1 | 39.6 | 39.6 | 1205 |
DeepSeek-R1-Distill-Llama-70B | 70.0 | 86.7 | 94.5 | 65.2 | 57.5 | 1633 |
In these benchmarks, Qwen and Llama models distilled from R1 perform better than GPT 4o and Claude 3.5 even at very low parameter sizes.
DeepSeek-R1-Distill-Qwen-32B
with 8-bit quantisation is just a bit too big for 32GB RAM.
It is possible to run LLMs that perform on the same level as Claude 3.5 or GPT 4o locally today, but inference will be slower than remote inference.
The larger question is "will proprietary models always be better than open models?", and I argue that since it is always possible to distil proprietary models (as is the case with R1), the answer is no.
Hardware gets cheaper and more efficient, and I believe local LLMs are inevitable. Thank you for coming to my TED talk.