Current Splitting Implementation

The commit splitting logic uses a fixed batch size of 3 files when autoSplit is enabled. In src/stores/commit/commitStore.ts, line ~340:

const batchSize = autoSplit ? 3 : Math.max(1, selected.length);
const totalCommits = Math.max(1, Math.ceil(selected.length / batchSize));

The actual grouping happens in the Rust backend via three Tauri commands:

commit_generate_plan (Apple AFM)
ollama_commit_generate_plan (Ollama local models)
remote_commit_generate_plan (BYO API keys)

How It Works Now

Files are batched in groups of 3 (hardcoded)
Each file is analyzed using the Diffsense diff algorithm to distribute context fairly
No semantic understanding of which files belong together logically

Proposed AI-Enhanced Approach

@eonist's suggestion in the follow-up comment outlines a smarter algorithm:

Dossier creation - Generate lightweight metadata for each file change optimized for grouping decisions
AI-powered grouping - Use heuristics + AI to determine logical commit boundaries based on:
- Related functionality (same feature/module)
- Dependency relationships
- Change type (refactor vs feature vs fix)
User intent respect - Honor custom prompt instructions for grouping preferences
Final review pass - AI validates the groupings make semantic sense

Key Technical Considerations

Aspect	Current	Proposed
Batch size	Fixed (3 files)	Dynamic, context-aware
Grouping logic	File count heuristic	Semantic relationship analysis
Scalability	O(n/3) commits	Recursive algorithm for 10-1000 files
Model support	Same across all	Works on AFM, improves on larger models

The bottleneck is inference time, not capability. Even with Apple's AFM (smallest local model), the algorithm could be recursive - taking 6-10 seconds for large refactors is acceptable when quality improves. Users wanting faster results could use larger models via BYO API keys or faster hardware.

Suggested Implementation Path

Create a FileDossier struct in Rust with: path, change type, affected symbols/imports, diff summary
Add a grouping phase before message generation that uses AI to cluster related dossiers
Make the batch size configurable (remove hardcoded 3)
Add unit/integration tests as both @eonist and @dernDren161 mentioned regressions are becoming hard to catch

What do you think? how could this be solved? Any other ideas than what has been suggested?

This is an interesting problem because it sits at the intersection of code understanding, user intent, and practical performance constraints. Here are some ideas beyond what's been discussed:

Alternative Approaches

1. Graph-Based Pre-Clustering

Before involving AI at all, build a dependency graph from static analysis:

Parse imports/requires to find file relationships
Detect shared symbols (functions, classes, types referenced across files)
Use community detection algorithms (like Louvain) to find natural clusters

This is fast, deterministic, and gives AI a head start. The AI then only needs to validate or refine clusters rather than discover them from scratch.

2. Embedding Similarity Approach

Instead of full LLM inference for grouping:

Generate lightweight embeddings for each diff (could use AFM's embedding mode or a small local model)
Cluster diffs by vector similarity
Use the LLM only for naming/describing the clusters

This could be 10x faster than having the LLM reason about all files simultaneously.

3. Hierarchical Splitting

For large changesets (50+ files), use a divide-and-conquer strategy:

100 files → group by directory → 8 directory clusters
Each cluster → AI refines into logical commits
Result: ~15-20 well-organized commits

This keeps context windows manageable and scales predictably.

4. Change-Type Categorization First

Before semantic grouping, categorize changes by type:

Category	Signals
Refactor	Renames, moves, no new exports
Feature	New files, new public APIs
Fix	Small modifications, test additions
Chore	Config files, dependencies, docs

Then group within categories. A "refactor" commit shouldn't mix with a "feature" commit even if they touch related files.

5. Test-Implementation Pairing

A simple but high-value heuristic: always group test files with their implementation. If UserService.ts changed, UserService.test.ts belongs in the same commit. This is purely path-based and catches a common mistake.

Hybrid Architecture Suggestion

Input: 47 file changes

Phase 1: Static Analysis (instant, no AI)
├── Directory clustering
├── Import graph analysis  
├── Test file pairing
└── Output: 8 candidate clusters

Phase 2: AI Refinement (fast, small prompts)
├── For each cluster: "Should this be 1 or 2 commits?"
├── Cross-cluster: "Do any files belong elsewhere?"
└── Output: 12 refined commits

Phase 3: Message Generation (current approach)
└── Generate title/body per commit

The key insight is that grouping and describing are separate problems. The current implementation conflates them. Splitting these phases means:

Phase 1 handles 80% of cases with zero AI cost
Phase 2 uses small, focused prompts (faster, works on AFM)
Phase 3 stays as-is

User Control Ideas

Drag-and-drop regrouping in the UI, then AI regenerates only affected commit messages
Grouping presets: "by directory", "by feature", "atomic (1 file = 1 commit)"
Learn from corrections: if user moves a file between commits, remember that pattern

What I'd Prioritize

If I were implementing this:

Short term: Remove hardcoded 3, make it ceil(files.length / target_commits) where target_commits is user-configurable
Medium term: Add test-file pairing heuristic + directory-based pre-clustering
Long term: Embedding-based similarity with AI refinement

The recursive/dossier approach @eonist suggested is good for quality, but the hybrid approach might give 90% of the benefit at 20% of the inference cost. Worth prototyping both and comparing.

the file changes gives two infos: the diff that actual changed, and access to entire file of course. how do you factor that in? or doesnt it change anything?

That changes things significantly. The full file context is much richer than the diff alone.

What Each Source Tells You

Signal	From Diff	From Full File
What changed	✓ Direct	✗
Change magnitude	✓ Lines added/removed	✗
Function/class modified	~ Partial (if name in diff)	✓ Full structure
File's role in codebase	✗	✓ Imports, exports, patterns
Related files	✗	✓ Import statements
Domain/feature area	~ From path/variable names	✓ Full context

Practical Impact on Grouping

Diff-Only Approach (Current)

You're essentially doing "blind" grouping - you see what changed but not why it matters or how files relate. Two files might both add 10 lines, but without full context you can't tell one is a React component and the other is its API hook.

Full-File Approach (Opportunity)

You can extract:

// From UserProfile.tsx full file:
import { useUserData } from './hooks/useUserData';
import { Avatar } from '@/components/Avatar';
import type { User } from '@/types/user';

// Now you KNOW this file relates to useUserData.ts, Avatar.tsx, user.ts

This import graph is deterministic and free - no AI needed.

Refined Strategy

Phase 0: Extract Metadata (No AI)

For each changed file, parse the full file to get:

interface FileDossier {
  path: string;
  diff: string;              // What changed
  imports: string[];         // What it depends on
  exports: string[];         // What it exposes
  fileType: 'component' | 'hook' | 'util' | 'test' | 'config' | 'type';
  detectedDomain: string;    // From path: "user", "auth", "checkout"
}

Phase 1: Build Relationship Graph (No AI)

UserProfile.tsx ──imports──→ useUserData.ts
UserProfile.tsx ──imports──→ Avatar.tsx
UserProfile.test.tsx ──tests──→ UserProfile.tsx

If UserProfile.tsx and useUserData.ts both changed → strong signal they belong together.

Phase 2: AI Only Where Needed

Use AI for ambiguous cases:

Two files both changed, no import relationship, but both touch "user" domain
Large file with multiple unrelated changes (could be split into 2 commits)
Determining if a change is "refactor", "feature", or "fix"

The Key Insight

Diff = what changed. Full file = why it matters.

For grouping, "why it matters" is often more valuable. A one-line change to an API endpoint file might be the cause of 15 other file changes (updating consumers). The diff alone makes it look minor; the full file reveals it's the central piece.

Context Window Consideration

You can't send 47 full files to the LLM - that'll blow context limits. But you can:

Send full files through a local parser (AST parsing, regex for imports) - instant, unlimited
Build the relationship graph from that
Send only diffs + dossier summaries to the LLM

File: src/hooks/useUserData.ts
Imports: react, @/api/userApi, @/types/user
Exports: useUserData (hook)
Domain: user
Diff: +12 -3 (added caching logic)

File: src/components/UserProfile.tsx  
Imports: react, ./hooks/useUserData, @/components/Avatar
Exports: UserProfile (component)
Domain: user
Diff: +5 -2 (added loading state)

Relationship: UserProfile imports useUserData (BOTH CHANGED)

This compressed representation lets the AI make smart grouping decisions without seeing 2000 lines of code.

Our smallest model. AFM only has 12kb token window. is all this possible even with tiny context windows? we can always call AFM many times with fresh context windows. so the strategy must be to create temporary summaries that feed into the next etc. so divide and conquer etc

Exactly. 12KB is roughly 3000 tokens - that's tight. But the constraint actually forces a cleaner architecture.

The Core Principle

Never ask the LLM to see everything at once. Ask it many small questions and aggregate.

Token Budget Reality Check

With 12KB you can fit roughly:

Content	Tokens
System prompt	~300
1 file dossier (compressed)	~100-150
1 diff (typical)	~200-500
Response	~200-300

So realistically: 10-15 files per call maximum, and that's without full diffs.

Multi-Pass Architecture

Pass 1: Classify Each File (N calls, 1 file each)

Input (tiny):

File: src/hooks/useUserData.ts
Imports: react, @/api/userApi
Exports: useUserData
Diff: [50 lines]

Classify: refactor | feature | fix | chore
Domain: [one word]
Summary: [one sentence]

Output:

{ "type": "feature", "domain": "user", "summary": "Added caching to user data hook" }

This runs in parallel - 47 files = 47 calls, but they're fast and concurrent.

Pass 2: Cluster by Similarity (Few calls)

Now you have 47 tiny summaries (~50 tokens each). Feed batches to the LLM:

These files changed. Which belong together?

1. [feature][user] Added caching to user data hook
2. [feature][user] Added loading state to profile component  
3. [fix][auth] Fixed token refresh race condition
4. [feature][user] Added cache invalidation to user API
5. [chore][config] Updated eslint rules
...

Group by logical commit. Output: [[1,2,4], [3], [5], ...]

With compressed summaries, you can fit 40-50 files in one call.

Pass 3: Generate Messages (K calls, 1 per commit)

For each group:

Files in this commit:
- useUserData.ts: Added caching to user data hook
- UserProfile.tsx: Added loading state to profile component
- userApi.ts: Added cache invalidation

Write commit title and body.

Small input, focused output.

The Pipeline

47 files
    │
    ▼ (parallel, 47 calls)
┌─────────────────────┐
│ Pass 1: Classify    │  "What is this change?"
│ 1 file per call     │
└─────────────────────┘
    │
    ▼ (47 summaries, ~2KB total)
┌─────────────────────┐
│ Pass 2: Cluster     │  "What belongs together?"
│ All summaries       │
└─────────────────────┘
    │
    ▼ (say 8 groups)
┌─────────────────────┐
│ Pass 3: Describe    │  "Write the commit message"
│ 1 group per call    │
└─────────────────────┘
    │
    ▼
8 commits with messages

For 100+ Files (Recursive)

When summaries don't fit in one Pass 2 call:

100 files
    │
    ▼
100 summaries
    │
    ├─► Batch 1 (50 summaries) → 5 proto-groups
    └─► Batch 2 (50 summaries) → 5 proto-groups
                │
                ▼
         10 proto-groups
                │
                ▼ (merge pass)
         "Should any of these merge?"
                │
                ▼
          7 final groups

What Stays Local (Zero Tokens)

Don't waste tokens on things you can compute:

Task	Method
Import extraction	AST parse or regex
Test file pairing	Path matching (`.test.ts` → `.ts`)
Directory grouping	String operations
File type detection	Extension + path patterns
Diff stats	Already have from git

The LLM only answers questions that require judgment:

Is this a feature or a fix?
Do these two unrelated-looking files actually belong together?
What's a good commit message?

Time Estimate

Phase	47 files	100 files
Local parsing	<100ms	<200ms
Pass 1 (parallel)	~2s	~3s
Pass 2	~1s	~2s (2 batches)
Pass 3	~2s (8 commits)	~3s (12 commits)
Total	~5s	~8s

Assuming AFM runs at ~500 tokens/sec locally. Totally acceptable for a "smart split" operation.

The Key Realization

12KB isn't a limitation - it's a feature. It forces you to:

Decompose the problem properly
Keep each step focused and debuggable
Parallelize naturally
Cache intermediate results (summaries can be reused if user tweaks grouping)

Bigger models let you be lazy. Small models force good architecture.

its 4k tokens I think.

4K tokens is ~1000 words. That's genuinely tight. Let me rethink.

Revised Budget

Content	Tokens
System prompt	~150 (must be minimal)
Input	~250-300
Response	~100-150
Usable per call	~300 tokens of actual content

That's roughly one file with diff per call, or 20-25 compressed summaries without diffs.

Ultra-Lean Pipeline

Pass 1: Classify (Unchanged, but terser)

src/hooks/useUserData.ts
+import cache from './cache'
+const cached = cache.get(userId)
+if (cached) return cached
-return fetchUser(userId)
+const data = fetchUser(userId)
+cache.set(userId, data)
+return data

Type? Domain? 10-word summary.

Response: feature | user | Added caching layer to user data fetching

Still works at 4K. One file per call.

Pass 2: Cluster (Needs chunking)

25 summaries max per call. For 47 files:

Call 1: files 1-25 → proto-groups A
Call 2: files 26-47 → proto-groups B  
Call 3: merge A + B → final groups

Pass 3: Describe (Tight but doable)

Commit: useUserData.ts, UserProfile.tsx, userApi.ts
- Added caching layer to user data fetching
- Added loading state for cache delays
- Added cache invalidation endpoint

Title + body.

~150 tokens in, ~100 out. Works.

What Changes at 4K

Approach	12K	4K
Files per classify call	3-5	1
Summaries per cluster call	50	25
Merge passes needed (100 files)	1	2-3
System prompt	Detailed	Bare minimum

Critical Optimization: Ditch Natural Language

At 4K, every token counts. Use structured formats:

Instead of:

Please analyze this file change and tell me what type of change it is...

Use:

[src/hooks/useUserData.ts]
+cache.get +cache.set -fetchUser direct

?type ?domain ?sum

Response:

feat user cache-user-fetch

Brutal, but you can fit 3-4 files per call this way.

Revised Time Estimate

Phase	47 files	100 files
Pass 1	~3s (parallel)	~5s
Pass 2	~2s (2 cluster + 1 merge)	~4s (4 cluster + 2 merge)
Pass 3	~2s	~3s
Total	~7s	~12s

More calls, but still fast because each call is tiny and AFM responds quickly to short prompts.

The Escape Hatch

If even this is too slow, make Pass 1 optional:

Default: Use only local heuristics (imports, paths, test pairing) for grouping
"Smart mode": Enable AI classification when user wants better results

Most commits are <10 files anyway. The 4K limit only really hurts on large refactors, and users will tolerate 10-15 seconds for those.

and this must also work on bigger models. we offer ollama support as well. and byo keys with gemini 3, anthropic opus 4.5 and gpt 5.2 etc

That's the key constraint then. The architecture must scale up gracefully, not just survive at the bottom.

Model Tiers

Tier	Models	Context	Strategy
Tiny	AFM	4K	Many micro-calls
Medium	Ollama (Llama, Mistral)	8K-32K	Batched calls
Large	Gemini 3, Opus 4.5, GPT 5.2	128K-1M+	Few or single call

Adaptive Architecture

interface ModelCapabilities {
  contextWindow: number;
  tokensPerSecond: number;
  costPerToken?: number;  // null for local
}

function planStrategy(files: FileDossier[], model: ModelCapabilities) {
  const totalTokensNeeded = estimateTokens(files);
  
  if (totalTokensNeeded < model.contextWindow * 0.7) {
    return 'single-pass';      // Send everything at once
  } else if (model.contextWindow >= 32000) {
    return 'chunked-parallel'; // Few large batches
  } else {
    return 'micro-pipeline';   // Many tiny calls (AFM path)
  }
}

Single-Pass Mode (Large Models)

When you have 128K+ tokens, just send everything:

Here are 47 file changes with their diffs and full import context.

Group them into logical commits. For each commit, provide:
- Files included
- Commit title
- Commit body

Consider: related functionality, test/implementation pairing, 
change type (feature/fix/refactor), user instructions: "{customPrompt}"

[Full dossiers + diffs for all 47 files]

One call. Done in 3-5 seconds. Higher quality because the model sees all relationships simultaneously.

Chunked-Parallel Mode (Medium Models)

For 16-32K context:

Pass 1: Classify + Cluster combined
        - Send 15-20 files per call
        - Ask for both classification AND preliminary grouping
        - 3 calls for 47 files

Pass 2: Merge + Describe
        - Reconcile groups across chunks
        - Generate messages
        - 2-3 calls

Micro-Pipeline Mode (AFM)

The aggressive chunking strategy from before. Many calls, tiny payloads.

The Abstraction Layer

interface SplitPlan {
  commits: Array<{
    files: string[];
    title: string;
    body: string;
  }>;
}

// Single interface, multiple implementations
async function generateSplitPlan(
  files: FileChange[],
  model: ModelBackend,
  customPrompt?: string
): Promise<SplitPlan> {
  
  const capabilities = await model.getCapabilities();
  const strategy = planStrategy(files, capabilities);
  
  switch (strategy) {
    case 'single-pass':
      return singlePassSplit(files, model, customPrompt);
    case 'chunked-parallel':
      return chunkedSplit(files, model, customPrompt);
    case 'micro-pipeline':
      return microPipelineSplit(files, model, customPrompt);
  }
}

The UI doesn't care which path runs. Same input, same output, different execution.

Quality vs Speed Tradeoff

Model	Time (47 files)	Quality	Cost
AFM	~7s	Good	Free
Ollama 7B	~5s	Good	Free
Ollama 70B	~8s	Better	Free
GPT 5.2	~3s	Best	~$0.02
Opus 4.5	~4s	Best	~$0.05

Users self-select. Local-first folks use AFM/Ollama. Power users with API keys get faster + better results.

Prompt Templating

Same logical prompt, different verbosity:

const PROMPTS = {
  classify: {
    tiny: '[{path}]\n{diff_summary}\n?type ?domain ?sum',
    medium: 'File: {path}\nChanges: {diff}\nClassify type, domain, summary.',
    large: 'Analyze this file change:\nPath: {path}\nFull diff:\n{diff}\n\nProvide classification...'
  },
  cluster: {
    tiny: 'Group:\n{summaries}\nOutput: [[ids],...]',
    medium: 'Group these changes into logical commits:\n{summaries}',
    large: 'You are organizing code changes into atomic commits. Consider semantic relationships, shared functionality, test coverage...\n{full_context}'
  }
};

function getPrompt(task: string, tier: 'tiny' | 'medium' | 'large') {
  return PROMPTS[task][tier];
}

The Payoff

Build for AFM first. That forces clean decomposition. Then the larger models just... work better automatically because:

They can see more context per call → better relationship detection
Fewer calls → lower latency
More reasoning tokens → better edge case handling

The micro-pipeline isn't a compromise - it's the foundation. Large models just collapse multiple passes into one.

eonist/split-algo.md