Which models are good at code generation, and how do we objectively know?

LLMs are tested against benchmarks to assess their performance, some coding benchmarks are:

Which LLMs are good at code?

They're a bit all-over-the-place, for example Claude 3.5 performs best in SWE Verified with 50.8, but comes very far behind Deepseek R1 in LiveCodeBench (38.9 vs 65.9), and o1 in Codeforces (2061 vs 717)

There's an argument that chain of thought (CoT) models do better, because they argue with themselves a bit before giving their answer.

https://www.ibm.com/think/topics/chain-of-thoughts
Less is More Reasoning: https://arxiv.org/pdf/2502.03387

Let's say it's Claude 3.5, o1, and R1.

How big are the models that are good at code?

Parameter size

The problem with proprietary models is that you can't really know what their parameter size is unless their foundry publishes that info, which they usually don't.

This article suggests (without a source) that Claude 3.5 is 175B, and let's say it's trained with f16, you'd need at least 350GB of RAM to load Claude into memory.

Quantisation

If you downsample a 16-bit float into an 8-bit or 4-bit float you may or may not lose information, depending on what the original number was. According to research compressing a model beyond 4-bit floats causes a deterioration in performance, but 8-bits is largely comparable to the uncompressed equivalent A Comprehensive Evaluation of Quantization Strategies for Large Language Models.

In another paper, How Does Quantization Affect Multilingual LLMs?, they conclude "Challenging tasks degrade fast and severely (e.g. mathematical reasoning and responses to realistic challenging prompts)", and "We find that: (1) Damage from quantization is much worse than appears from automatic metrics: even when not observed automatically, human evaluators notice it.".

Red Hat compared various quantisations of Llama 3.1 (down to int8, no 4-bit comparisons), and found that their performance was comparable.

I speculate that for long sessions 8-bit quantisation is probably okay, but for 4-bit quantisation the low precision will compound and result in poor responses.

So with 8-bit quantisation I'd only need 175GB of RAM to run Claude 3.5, that's becoming attainable.

There's also dynamic quantisation, where only the weights are quantised and the intermediate results are not quantised - this could help the precision collapse in long sessions I mentioned.

Distillation

In January, Deepseek famously published DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. The R1 model (which is freely available) is a state-of-the-art model that performs in the same level as Claude 3.5, GPT 4o, and o1-mini.

R1 is a 671B parameter model, but the activated parameters total 37B.

Deepseek used a technique called distillation, where a larger model is prompted produce training data to fine-tune a smaller model. They produced several smaller (70B, 32B, 14B, 8B, etc.) models using Qwen and Llama as foundation models, and R1 for the distillation set. (It's widely accepted that they distilled o1 to create R1.)

Therefore, we can draw two conclusions: First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger- scale reinforcement learning.

Model	AIME 2024 (pass@1)	AIME 2024 (cons@64)	MATH-500 (pass@1)	GPQA Diamond (pass@1)	LiveCode Bench (pass@1)	CodeForces (rating)
GPT-4o-0513	9.3	13.4	74.6	49.9	32.9	759
Claude-3.5-Sonnet-1022	16.0	26.7	78.3	65.0	38.9	717
OpenAI-o1-mini	63.6	80.0	90.0	53.8	53.8	1820
QwQ-32B-Preview	50.0	60.0	90.6	54.5	41.9	1316
DeepSeek-R1-Distill-Qwen-1.5B	28.9	52.7	83.9	33.8	16.9	954
DeepSeek-R1-Distill-Qwen-7B	55.5	83.3	92.8	49.1	37.6	1189
DeepSeek-R1-Distill-Qwen-14B	69.7	80.0	93.0	53.9	53.1	1481
DeepSeek-R1-Distill-Qwen-32B	72.6	83.3	93.0	57.2	57.2	1691
DeepSeek-R1-Distill-Llama-8B	50.4	80.0	89.1	39.6	39.6	1205
DeepSeek-R1-Distill-Llama-70B	70.0	86.7	94.5	65.2	57.5	1633

In these benchmarks, Qwen and Llama models distilled from R1 perform better than GPT 4o and Claude 3.5 even at very low parameter sizes.

DeepSeek-R1-Distill-Qwen-32B with 8-bit quantisation is just a bit too big for 32GB RAM.

Summary

It is possible to run LLMs that perform on the same level as Claude 3.5 or GPT 4o locally today, but inference will be slower than remote inference.

The larger question is "will proprietary models always be better than open models?", and I argue that since it is always possible to distil proprietary models (as is the case with R1), the answer is no.

Hardware gets cheaper and more efficient, and I believe local LLMs are inevitable. Thank you for coming to my TED talk.

LukeChannings/local-llms.md