Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save eonist/4ee4a0c55f889f6138ce02992080bf7e to your computer and use it in GitHub Desktop.

Select an option

Save eonist/4ee4a0c55f889f6138ce02992080bf7e to your computer and use it in GitHub Desktop.
limitation with local models for code review.md

Yes, this feature is absolutely buildable — the spec, the API surface, and the existing Omni architecture all line up cleanly. Here's my assessment across both the feature itself and AFM compatibility.

Feature Feasibility

The core mechanism is straightforward: a single POST /repos/{owner}/{repo}/pulls/{pull_number}/reviews that atomically submits a summary, a verdict (APPROVE / REQUEST_CHANGES / COMMENT), and an array of inline comments each pinned to a path, line, and side . Omni already has the GitHub OAuth identity and write capabilities in githubWrite.ts, so the API plumbing is an extension of what exists .

The UX spec in the issue comments maps directly onto the existing PRView component tree — adding a ReviewList in the sidebar, a ReviewDetailView in the DetailsPanel, and inline annotations on the DiffViewer . The changes needed are well-scoped:

Component Change
Sidebar/index.tsx Add <ReviewList /> below <FileList />
DetailsPanel Add ReviewDetailView as a new view option
FileDetailView.tsx Pass review comments to DiffViewer, render annotations
BottomBar.tsx Add "Submit Review" action when a draft exists
usePRStore / useUIStore Track selectedReviewId and draftReview state

New files: ReviewList.tsx, ReviewDetailView.tsx, reviewStore.ts, useReview.ts .

The AI Generation Layer

The blog post describes Omni's 6-source enrichment pipeline (diff, project files, linked issues/PRs, web references, agent notes, style preferences) feeding a model that outputs structured inline comments + summary + verdict . The model needs to produce structured JSON like:

{
  "body": "Overall summary...",
  "event": "REQUEST_CHANGES",
  "comments": [
    { "path": "src/auth.ts", "line": 42, "side": "RIGHT", "body": "..." }
  ]
}

This is the part where AFM compatibility matters.

AFM Support Assessment

Apple Foundation Models (the on-device models available via the FoundationModels framework on macOS with Apple Intelligence) can technically handle this, but with important caveats:

What works well:

  • AFM supports structured generation / tool calling, which maps directly to outputting the review JSON schema (path, line, side, body per comment)
  • Running on-device means zero latency to a cloud API and no token costs — great for the "one button, full review" UX
  • In a Tauri app, you can bridge to AFM via a Swift native plugin (Tauri's plugin system → Rust FFI → Swift FoundationModels framework)

What's challenging:

  • Context window size — AFM's on-device models are significantly smaller than cloud models (Claude, GPT-4, etc.). Fitting all 6 context sources (full diff + project files + issues + web refs + agent notes + style prefs) into a single prompt may exceed what AFM can handle for large PRs
  • Reasoning quality — generating accurate line numbers and correctly correlating diff hunks with project-wide context requires strong reasoning. Smaller on-device models may produce less precise line-pinned comments than a cloud model
  • Availability — AFM requires Apple Silicon + macOS with Apple Intelligence enabled, which limits your user base to recent Macs

Recommended approach:

  • Make the model layer swappable — define a ReviewGenerator interface that takes the enriched context and returns the structured review JSON
  • Implement AFM as one backend (via Tauri native Swift plugin), alongside cloud model backends (OpenAI, Anthropic, etc.)
  • For AFM specifically, use a chunked strategy: summarize each of the 6 sources individually first, then feed the compressed context into the final review generation call to stay within the on-device context window
  • Fall back to cloud models for large PRs where the context exceeds AFM's capacity

Verdict

The feature is very doable — the spec is tight, the API is atomic, and it layers cleanly onto the existing PR view . AFM can work as one of several model backends, but shouldn't be the only backend due to context window and reasoning limitations on large diffs. The right architecture is a pluggable model interface where AFM handles quick reviews on smaller PRs locally, and cloud models handle the heavy lifting when needed .


bu can the AFM model reason about code review? say if I make a pr on OAuth. and I forgot to add a comment to one code block. would AFM be able to catch something like that?

Honestly, no — AFM almost certainly would not reliably catch something like a missing comment on an OAuth code block. Here's why.

The Model Is Too Small for Code Reasoning

AFM on-device is a ~3B parameter model quantized down to 2 bits, making its effective capacity even smaller. It was designed and fine-tuned for consumer text tasks — summarization, writing tools, content tagging, smart replies — not code understanding or software engineering reasoning. Apple's own documentation describes it as a "text-based on-device model" that "identifies patterns for generating new text".12345

For context on what model sizes can actually do with code:

Task What It Takes
"Is this comment missing?" Understand code semantics + project conventions — needs ~7B+ code-specialized model minimum
"This OAuth flow doesn't handle token expiry" Reason about security patterns across files — needs frontier-class model (Claude, GPT-4)
Reliable line-pinned inline review comments Correlate diff hunks with project context — needs strong reasoning + large context window

Even purpose-built code models at 7B parameters (CodeLlama-7B, DeepSeek-Coder-6.7B) struggle with nuanced review tasks. A general-purpose 3B model at 2-bit quant is not in the same league.6

What AFM Is Actually Good At

The framework excels at structured generation (guided output that conforms to a Swift schema), tool calling, and short-form text tasks. So AFM could reliably format the review JSON structure — it's the reasoning content inside that structure that would be shallow or wrong.78

A realistic failure mode: AFM might generate a review that looks correct (proper path, line, side fields) but the body comments would be surface-level parroting like "consider adding documentation" rather than actually understanding that your specific OAuth code block implements PKCE and the missing comment should explain the code verifier derivation.

Guardrails Are Also a Problem

Developers have reported that AFM has aggressively restrictive guardrails that can refuse or error on unexpected content categories. Code that touches auth, security, or cryptographic patterns could potentially trigger refusals depending on how the content filters interpret the prompt — which would be a showstopper for reviewing OAuth PRs.9

What Would Actually Work

For Omni's review feature to deliver the quality described in your blog post — where the AI "sees how that line connects to the rest of the codebase, whether it contradicts a linked issue" — you need a model that can:

  1. Hold the full context — diff + project files + issues + style prefs often exceeds 3B model's practical context window
  2. Reason about code semantics — not just pattern-match, but understand what the code does
  3. Generate precise line-pinned feedback — map insights back to exact diff locations

That's firmly in frontier model territory (Claude Sonnet/Opus, GPT-4o, Gemini Pro). AFM could serve as a fast fallback for trivial tasks — like generating a PR summary from the diff — but for the actual code review intelligence, you'd want to route through a cloud model. The pluggable ReviewGenerator interface approach still makes sense, but AFM would be the "lite mode," not the real thing. 101112131415161718192021222324252627


what ollama model would work with this review feature?

For Omni's review feature specifically, Qwen 2.5 Coder 32B is your best bet on Ollama — it's the sweet spot between running locally and having enough reasoning capability to generate meaningful code review comments.2829

Recommended Models by Hardware

Model VRAM Needed Aider Score Code Review Quality
Qwen 2.5 Coder 32B ~20GB (Q4) 72.9% Good — matches GPT-4o level, can catch missing comments, auth pattern issues
Qwen 2.5 Coder 14B ~9GB (Q4) 69.2% Decent — catches obvious issues, may miss nuanced context
DeepSeek R1 Distill 32B ~20GB (Q4) Strong reasoning but slow — better for complex logic bugs
Qwen3-Coder Varies Newer option with longer context window, good for large diffs
Qwen 2.5 Coder 7B ~5GB (Q4) 57.9% Too shallow for real code review — fine for autocomplete only

The 32B Qwen Coder scores 72.9% on the Aider benchmark, which is on par with GPT-4o. That's the threshold where a model can realistically look at your OAuth diff, notice a code block lacks documentation, and write a meaningful inline comment like "this PKCE verifier derivation should have a comment explaining the flow."30

Why 32B Is the Minimum for Reviews

Code review is fundamentally harder than code generation — the model needs to simultaneously hold the diff, understand the surrounding codebase context, reason about what's missing (not just what's wrong), and output structured JSON with precise line numbers . The 7B and even 14B models struggle with this multi-concern reasoning because they hold fewer patterns simultaneously. The gap becomes obvious when you're asking "what did the developer forget?" versus "write me a function."30

Practical Setup for Omni

Since you're on macOS (likely M-series), you're in a good position — Apple Silicon shares unified memory between CPU and GPU:30

  • M2/M3 Max (32GB+) → Qwen 2.5 Coder 32B runs comfortably at Q4
  • M2/M3 Ultra (64GB+) → Could even run 70B models, though overkill for this feature
  • M2/M3 Pro (18GB) → Limited to 14B, which is borderline for review quality
ollama pull qwen2.5-coder:32b

From your Tauri app, you'd hit Ollama's local API (http://localhost:11434/api/chat) with the enriched context from your 6-source pipeline and a structured output prompt that produces the review JSON.29

The Honest Trade-off

Even the 32B model will produce shallower reviews than Claude Sonnet or GPT-4o for complex PRs — it may catch the missing comment but miss that your OAuth flow has a subtle CSRF vulnerability in the state parameter handling. The hybrid approach makes the most sense: Ollama 32B as the default local engine for fast, private, cost-free reviews, with a cloud model fallback for when the user wants a deeper analysis. Your ReviewGenerator interface can expose this as a simple toggle — "Quick Review (local)" vs "Deep Review (cloud)."30 313233343536373839404142


what about for MacBook m1 16gb

That's a tight squeeze. On an M1 with 16GB unified memory, macOS itself eats ~4–6GB, leaving you roughly 10–12GB for the model + context — which seriously limits your options.4344

What Actually Fits

Model Size (Q4) Fits in 10–12GB? Code Review Quality
Qwen 2.5 Coder 7B ~4.5GB ✅ Comfortable, room for context Surface-level — catches obvious stuff
DeepSeek Coder V2 Lite (16B) ~9GB ⚠️ Tight, minimal context window Better reasoning, but may swap
Qwen 2.5 Coder 14B ~9GB ⚠️ Tight, reported laggy on M1 16GB Decent, but real-world performance suffers
Gemma 3 12B ~7.5GB ✅ Fits with headroom General-purpose, not code-specialized
Qwen 2.5 Coder 32B ~20GB ❌ Won't fit

The fundamental problem is that even if a 14B model loads, you'll be memory-starved for context. Multiple users on 16GB Macs report that anything above ~12B leads to heavy swap usage, which on M1 means token generation drops to single-digit tokens/sec — painfully slow for generating a multi-comment review.4544

The Realistic Pick: Qwen 2.5 Coder 7B

ollama pull qwen2.5-coder:7b

This runs at roughly 19–35 tokens/sec on M1 16GB with room to spare for your enrichment context. The catch is that it scores ~57.9% on the Aider benchmark — which means for your OAuth PR scenario:4643

  • ✅ Would probably catch a completely missing docstring/comment on a function
  • ✅ Can identify obvious issues like unused variables, missing error handling
  • ⚠️ Might miss why a specific code block needs a comment (e.g., "this implements PKCE, explain the verifier")
  • ❌ Unlikely to catch subtle auth logic flaws or cross-file concerns

A More Honest Take for Omni

For an M1 16GB user, the local-only code review would produce noticeably weaker results than the cloud path. The practical approach for your ReviewGenerator interface would be:

  • Local (7B) → marketed as "Quick Review" — fast, private, catches low-hanging fruit, good enough for formatting/docs/obvious issues
  • Cloud (Claude/GPT-4o) → marketed as "Deep Review" — the full 6-source enrichment pipeline with real code reasoning

You could also do a hybrid: run the 7B locally to generate a draft review skeleton (identify which files/lines are interesting), then send just those targeted chunks to a cloud model for the actual comment text. This keeps the cloud token cost low while producing quality output even on constrained hardware.

If Users Upgrade Later

It's worth noting the M1 16GB is now a 5-year-old machine. For users on M3/M4 Pro with 36GB+, the 32B model becomes viable and the local experience gets dramatically better. Building the pluggable architecture now means those users automatically get the upgrade path without any feature changes on your side.45 4748495051525354555657

Footnotes

  1. https://www.apple.com/newsroom/2025/09/apples-foundation-models-framework-unlocks-new-intelligent-app-experiences/

  2. https://machinelearning.apple.com/research/apple-foundation-models-2025-updates

  3. https://developer.apple.com/documentation/FoundationModels

  4. https://datawizz.ai/blog/apple-foundations-models-framework-10-best-practices-for-developing-ai-apps

  5. https://www.reddit.com/r/LocalLLaMA/comments/1l7l39m/apples_on_device_foundation_models_llm_is_3b/

  6. https://checkthat.ai/answers/best-llm-for-coding

  7. https://developer.apple.com/videos/play/wwdc2025/286/

  8. https://developer.apple.com/videos/play/wwdc2025/301/

  9. https://www.reddit.com/r/LocalLLaMA/comments/1leq843/the_bizarre_limitations_of_apples_foundation/

  10. https://machinelearning.apple.com/research/apple-foundation-models-tech-report-2025

  11. https://www.infoq.com/news/2025/07/apple-foundation-models-ios26/

  12. https://www.infoq.com/news/2024/08/apple-foundation-models/

  13. https://developer.apple.com/documentation/FoundationModels/generating-content-and-performing-tasks-with-foundation-models

  14. https://developer.apple.com/videos/play/wwdc2025/259/

  15. https://machinelearning.apple.com/research/introducing-apple-foundation-models

  16. https://www.createwithswift.com/exploring-the-foundation-models-framework/

  17. https://www.ibm.com/think/news/wwdc-2025-live

  18. https://learnprompting.org/docs/models/apple_foundation_models

  19. https://www.reddit.com/r/LocalLLaMA/comments/1lak9yb/findings_from_apples_new_foundationmodel_api_and/

  20. https://blog.trailofbits.com/2024/06/14/understanding-apples-on-device-and-server-foundations-model-release/

  21. https://arxiv.org/html/2407.21075

  22. https://lobehub.com/blog/claude-3-vs-gpt-4

  23. https://arxiv.org/pdf/2507.13575.pdf

  24. https://venturebeat.com/ai/ios-gets-an-ai-upgrade-inside-apples-new-intelligence-system

  25. https://www.kolosal.ai/blog-detail/top-5-best-llm-models-to-run-locally-in-cpu-2025-edition

  26. https://justaithings.org/apple-introduces-foundation-models-framework-to-let-you-run-a-3b-model-on-your-phone-89194d00cc18

  27. https://www.deeplearning.ai/the-batch/issue-306/

  28. https://whatllm.org/blog/best-open-source-models-january-2026

  29. https://www.codegpt.co/blog/best-ollama-model-for-coding 2

  30. https://failingfast.io/local-coding-ai-models/ 2 3 4

  31. https://huggingface.co/blog/daya-shankar/open-source-llms

  32. https://deepgram.com/learn/best-local-coding-llm

  33. https://www.reddit.com/r/ollama/comments/1hwb5us/which_coding_model_do_you_prefer_using_with/

  34. https://www.reddit.com/r/LocalLLaMA/comments/1j2usb0/is_qwen_25_coder_still_the_best/

  35. https://localaimaster.com/models/best-ai-coding-models

  36. https://launchdarkly.com/docs/tutorials/ollama-javascript

  37. https://www.reddit.com/r/LocalLLaMA/comments/1q82ae8/start_of_2026_whats_the_best_open_coding_model/

  38. https://dev.to/lightningdev123/top-5-local-llm-tools-and-models-in-2026-1ch5

  39. https://www.youtube.com/watch?v=y-P85ww2RHU

  40. https://composio.dev/blog/qwen-3-vs-deepseek-r1-complete-comparison

  41. https://www.youtube.com/watch?v=e5bSX1iNvH0

  42. https://pinggy.io/blog/top_5_local_llm_tools_and_models/

  43. https://localaimaster.com/blog/mac-local-ai-setup 2

  44. https://www.reddit.com/r/ollama/comments/1mjo9ki/best_models_under_16gb/ 2

  45. https://www.zdnet.com/article/i-tried-local-ai-on-m1-mac-brutal-experience/ 2

  46. https://www.youtube.com/watch?v=A8bOjO-boDU

  47. https://localaimaster.com/blog/run-llama3-on-mac

  48. https://www.linkedin.com/pulse/benchmarking-local-ollama-llms-apple-m4-pro-vs-rtx-3060-dmitry-markov-6vlce

  49. https://ollama.com/library/qwen2.5-coder:14b

  50. https://www.youtube.com/watch?v=ribc9QXV_-w

  51. https://www.youtube.com/watch?v=hrcUVlqimTo

  52. https://dev.to/atsushiambo/running-qwen-nearly-as-powerful-as-deepseek-on-a-macbook-pro-367k

  53. https://www.reddit.com/r/LocalLLM/comments/1qjjwtu/as_of_january_2026_what_the_best_coding_model/

  54. https://apxml.com/posts/best-local-llms-apple-silicon-mac

  55. https://www.codegpt.co/blog/choosing-best-ollama-model

  56. https://dev.to/ocodista/deepseek-r1-7bs-performance-on-a-developers-macbook-3mg2

  57. https://www.reddit.com/r/LocalLLaMA/comments/1lj69o2/qwen25_14b_on_an_m1_pro_with_16gb_ram/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment