"Compression" vs "Distillation" - Does Terminology Matter?

The Hypothesis

Would renaming the task from "compression" to "distillation" change LLM behavior, particularly around accuracy vs. brevity trade-offs?

Semantic Analysis

"Compression" Implies:

Physical/Technical Metaphor:

Make smaller
Remove redundancy
Optimize for size
Lossy vs lossless
Efficiency metric: compression ratio (10x, 5x)

Associated Concepts:

ZIP files, JPEG compression
Information theory
Size reduction as primary goal
"Compress this as much as possible"

Cognitive Frame:

Success = smaller output
Trade-off between size and quality
"Good compression" = high ratio
Emphasis on REMOVAL

Potential LLM Interpretation:

"My job is to make this shorter"
"I should remove as much as possible"
"Being helpful means being concise"
Size reduction is the KPI

"Distillation" Implies:

Chemical/Refinement Metaphor:

Extract essence
Purify
Concentrate the important parts
Remove impurities while preserving core

Associated Concepts:

Chemistry: separating components
Whiskey distillation: concentrating flavor
Essence extraction
Refinement process
Quality improvement through purification

Cognitive Frame:

Success = purer output
Trade-off between purity and completeness
"Good distillation" = preserves essential character
Emphasis on REFINEMENT

Potential LLM Interpretation:

"My job is to extract the essential information"
"I should keep what's truly important"
"Being helpful means preserving critical details"
Purity/essence is the KPI

Predicted Behavioral Differences

What "Compression" Might Encourage:

Aggressive removal - "How can I make this shorter?"
Pattern completion to save space - "Instead of listing all methods, I'll show a pattern"
Inventing summaries - "Rather than incomplete info, I'll summarize what it probably is"
Optimization thinking - "Can I express this in fewer tokens?"
Ratio focus - "I achieved 10x compression!"

Result: More likely to invent/infer to achieve better compression ratio.

What "Distillation" Might Encourage:

Selective preservation - "What's the essence here?"
Quality over quantity - "Better to preserve one thing perfectly than guess at five"
Admitting gaps - "The essence includes knowing what's missing"
Purity thinking - "Is this truly from the source, or am I adding impurities?"
Essence focus - "I preserved the core correctly"

Result: More likely to mark incomplete info and preserve accuracy.

LLM Training Data Patterns

"Compression" in Training Data:

Common contexts:

File compression (technical)
Image compression (lossy acceptable)
Data compression algorithms
"Compress this summary to 100 words" (aggressive reduction)
Video compression (quality trade-offs)

Pattern learned: Compression involves trade-offs, and some loss is acceptable.

"Distillation" in Training Data:

Common contexts:

Academic papers: "We distill the key findings"
Chemistry: precise extraction process
Knowledge distillation (ML): transferring learned knowledge accurately
"Distilled wisdom" (preserving essence)
Whiskey/spirits: concentrating without losing character

Pattern learned: Distillation preserves essential qualities while removing non-essentials.

Psychological Framing Effects

Framing Theory Applied to LLMs

If LLMs are trained on human text, they inherit human cognitive biases, including framing effects.

Compression Frame:

Activates "efficiency" heuristics
Primes "make smaller" goal
Suggests size-based metrics
Implies some loss is acceptable ("lossy compression")

Distillation Frame:

Activates "quality preservation" heuristics
Primes "extract essence" goal
Suggests purity-based metrics
Implies the result should be truer, not just smaller

Real-World Evidence From Our Testing

Current Prompt Uses "Compress/Compression" 81 Times

Count of key terms in current prompts:

"compress/compression/compressed" - 81 mentions
"accuracy/accurate" - 23 mentions
"exact/exactly" - 47 mentions

Ratio: 81 compression mentions vs 70 accuracy mentions

Observation: Even with strong accuracy instructions, models still invented signatures. Could the overwhelming "compression" framing be overriding accuracy messages?

Hypothetical Reframing

Current Opening (V3):

You are compressing Swift API documentation for use by an LLM code assistant. 
Your goal is **conceptually lossless** compression...

Problems:

"Compression" is the verb (action)
"Conceptually lossless" is a modifier fighting against the compression frame
It's asking for lossless compression, but compression implies possible loss

Alternate Opening (Distillation):

You are distilling Swift API documentation for use by an LLM code assistant.
Your goal is to extract and preserve all essential API information...

Advantages:

"Distillation" implies preservation of essence
"Extract and preserve" emphasizes keeping, not removing
"Essential API information" focuses on what matters
No implied trade-off between size and quality

Counter-Arguments

Why Terminology Might NOT Matter:

Explicit instructions dominate: The detailed rules override semantic framing
LLMs are literal: They follow instructions more than implied meanings
Technical context: In API documentation, both terms are well-understood
Training data: Models see both terms in similar contexts enough to understand equivalence

Why It Still Might Matter:

Implicit bias accumulation: Small framings add up across the entire prompt
Goal activation: The first sentence sets the mental "mode"
Conflict resolution: When instructions conflict, framing influences which wins
Completion bias: Under uncertainty, framing guides which completion feels "right"

Test Design

To actually measure the effect:

Test Set:

Same GRDB documentation

Condition A (Compression):

Prompt uses "compress/compression" terminology throughout

Condition B (Distillation):

Prompt uses "distill/distillation" terminology throughout

Everything Else:

Identical instructions, same rules, same examples

Metrics:

Number of invented signatures
Use of [NOT IN SOURCE] markers
Syntax errors
Completeness of protocol coverage
Output length (to control for just making things longer)

Hypothesis:

"Distillation" framing will result in:

Fewer invented signatures (primary metric)
More [NOT IN SOURCE] markers
Similar or slightly longer output length
More conservative/accurate documentation

Practical Recommendation

Combined Approach:

Use "distillation" as the primary metaphor, but acknowledge size:

You are distilling Swift API documentation to extract its essential technical content for an LLM code assistant.

**Goal:** Preserve all critical API information with perfect accuracy while removing verbose explanations, marketing language, and tutorials.

**Think of this as:** Distilling whiskey - the result should be more concentrated but must preserve the exact character and qualities of the source. You're removing water, not changing the spirit.

**Not compression:** This is not about achieving a size ratio. A longer, accurate distillation is far better than a shorter, inaccurate compression.

**Success metric:** A developer can trust every signature you document and knows clearly when information is incomplete.

Why this works:

Primary frame: Distillation (quality/purity)
Explicit rejection: "Not compression" directly counters any compression bias
Concrete metaphor: Whiskey distillation is visceral and clear
Success metric: Shifted from "size" to "trust"

Expected Impact

Conservative Estimate:

10-20% reduction in invented signatures
Terminology alone won't fix the problem, but might help at the margins

Optimistic Estimate:

30-50% reduction in invented signatures
Framing effects on LLMs may be stronger than expected
Combined with better instructions, could significantly improve accuracy

Realistic Expectation:

Some improvement, but not a silver bullet
Worth doing as part of a multi-pronged approach
Low cost (just word changes), potential benefit

Other Terminology Options

Alternative Metaphors to Consider:

"Extraction"

Pro: Implies taking only what's there
Con: Sounds mechanical, less quality focus

"Refinement"

Pro: Implies improving quality
Con: Might suggest changing/improving the source

"Crystallization"

Pro: Implies pure, precise form
Con: Too abstract, less common

"Concentration"

Pro: Clear chemical process metaphor
Con: Less distinct from compression

"Purification"

Pro: Strong accuracy/purity connotation
Con: Implies source is "impure" (awkward framing)

"Distillation" appears to be the best alternative - it's:

Common enough to be understood
Scientific/precise in connotation
Implies both reduction AND preservation
Has positive quality associations

V4 Prompt Opening (Proposed)

# Swift API Documentation Distillation Guide

You are distilling Swift API documentation to extract its essential technical content for an LLM code assistant.

**Distillation Philosophy:** Like distilling spirits, you're removing dilution (verbose explanations, marketing, tutorials) while preserving the exact character and essence (all API signatures, critical behaviors, warnings).

**Iron Law:** The distilled result must be perfectly accurate. Every API signature you include must be exact. Every behavior you document must be verifiable in the source.

**Not Compression:** This is not about minimizing size or achieving a compression ratio. A longer, accurate distillation is infinitely better than a shorter, inaccurate one.

**Success Criteria:**
1. A developer can trust every signature is real and exact
2. All incomplete information is clearly marked
3. Critical warnings and behaviors are preserved
4. No invented or inferred APIs

**Think of yourself as:** A careful chemist extracting pure essence, not a file compression algorithm optimizing for size.

Conclusion

Does terminology matter? Probably yes, to some degree.

Will it solve the problem alone? No - we need all the other improvements too.

Is it worth changing? Yes - low cost, potential benefit, and it makes the goal clearer.

Recommendation: Use "distillation" as the primary metaphor in V4, explicitly contrast it with compression, and use the whiskey distillation metaphor for clarity.

Bottom line: Terminology is one tool among many, but given LLMs' sensitivity to framing and their training on human text with human cognitive biases, there's good reason to think "distillation" would activate more accuracy-preserving behaviors than "compression."

czottmann/Terminology analysis.md

PREFACE WITH:

KEEP (Essential API Information):

REMOVE (Verbose Content):

FORMAT GUIDELINES:

OUTPUT STRUCTURE:

DISTILLATION TARGETS:

QUALITY CHECK: