The Enshittification of Closed-Weight Frontier Models

Ruminations on Theory and Motivations

The Concept of Enshittification: Coined by Cory Doctorow, it describes the pattern where platforms initially offer great value to attract users, then lock them in, and finally extract value by degrading the service for users while increasing value extraction for business customers (advertisers, etc.) or, in this case, the platform owner themselves by reducing costs.
Applying it to Frontier AI Chatbots:
- Phase 1: Attract Users: Release a groundbreaking model (e.g., initial GPT-4, Claude 3 Opus). Offer free access or affordable subscriptions. Generate massive hype and positive press. Users are amazed by the capabilities (complex reasoning, creativity, coding).
- Phase 2: Lock-in Users: Users integrate the tool into their daily workflows, studies, or creative processes. They become accustomed to its abilities and interface. Subscription models create a direct financial lock-in. Network effects are less direct than social media but exist in terms of shared knowledge, prompting techniques, and ecosystem tools built around the popular models.
- Phase 3: Extract Value / Degrade Service (The Hypothesis): This is where the prompt's core idea lies. After achieving scale and user dependency, the provider faces immense operational costs (GPU inference time is expensive!). To improve margins or simply manage costs under heavy load, they might subtly degrade the user experience while claiming the model is the same "brand name". This degradation is hypothesized to manifest as reduced capability aimed at saving compute resources.
Motivations for Degradation (Cost Saving):
- Inference Costs: Running large frontier models is extremely expensive. Every query consumes significant GPU resources. Reducing this cost per query, even slightly, translates to massive savings at scale.
- Scaling Challenges: As user numbers explode, maintaining the initial high-quality performance for everyone becomes computationally infeasible or requires astronomical investment. Load balancing might involve routing users to less capable endpoints.
- Profitability Pressure: Venture capital and market expectations demand a path to profitability. Cutting operational expenditure (OpEx) is a primary lever.
- Tier Differentiation: Intentionally making the free or cheaper tiers slightly less capable can incentivize upgrades to premium tiers (which might also be subtly degraded from the original peak, but still better than the lower tiers).
- Plausible Deniability: AI model outputs are inherently stochastic. Performance can vary. Users might complain, but proving systematic degradation without transparency or API access is hard. The company can attribute perceived changes to model updates (claiming improvements), prompt variations, or simply the nature of AI. The "web chat interface" limitation makes this easier to obscure than API changes where users might have more tools for analysis.
How Degradation Might Manifest (Compute Saving Techniques):
- Model Swapping/Routing: Silently routing some queries (especially from free users or during peak load) to smaller, faster, cheaper-to-run models under the same product name.
- Aggressive Quantization/Pruning: Using versions of the model that have been more heavily optimized for speed/size, potentially sacrificing nuance, reasoning depth, or accuracy.
- Reduced Inference Parameters: Using faster but potentially lower-quality decoding strategies (e.g., lower beam width, greedy sampling, lower temperature implicitly).
- Shorter Effective Context: Even if the stated context window is large, the model might be implicitly configured to pay less attention to earlier parts of the conversation to save processing.
- Reduced Search/Tool Use: Limiting the frequency, depth, or computational budget allocated to integrated tools like web browsing or code execution, making the model seem less capable or knowledgeable about recent information.
- Stricter/Simpler Guardrails: Implementing computationally cheaper safety filters that might be overly broad, leading to more refusals or generic, safe responses instead of nuanced ones.

Experiment Design: The Longitudinal Web Chat Benchmark

Objective: To detect statistically significant changes (specifically, degradation) in the capabilities of a closed-weight frontier AI model accessed via its web chat interface over time, consistent with potential cost-saving measures.

Hypothesis: Post-launch and initial hype phase, the effective quality and capability of responses from Model X via its web interface will decrease on specific compute-intensive tasks, potentially correlating with user tier (free vs. paid) or time of day/week (peak load).

Methodology:

Model & Interface Selection:
- Choose 1-3 specific frontier models accessible via web chat (e.g., ChatGPT-4, Claude 3 Opus via claude.ai, Gemini Advanced).
- Note the specific interface URL and any version indicators available.
Account Setup:
- Create multiple accounts for each service:
  - At least 2 Free tier accounts.
  - At least 2 Paid tier accounts (if applicable).
- This helps control for account-specific variations and test tier differences. Use separate browser profiles/containers for each account to avoid cross-contamination.
Benchmark Prompt Suite Development:
- Create a fixed set of diverse prompts designed to test capabilities potentially sensitive to compute reduction. This suite must remain constant throughout the experiment.
- Categories:
  - Complex Reasoning: Logic puzzles, multi-step math problems, causal reasoning questions. (Sensitive to model depth/size).
  - Long Context Recall/Synthesis: Provide a long text (~75% of claimed context window) and ask detailed questions requiring synthesis of information from different parts. (Sensitive to effective context handling).
  - Creative Constraint Following: Generate a story/poem/code snippet adhering to multiple, specific, potentially conflicting constraints (e.g., specific keywords, rhyme scheme, variable naming conventions, specific plot points). (Sensitive to instruction following fidelity and processing depth).
  - Nuance and Subtlety: Ask questions requiring understanding of subtle implications, irony, or complex emotions. (Sensitive to model size/quantization).
  - Coding Tasks: Generate moderately complex code (e.g., a class structure, an algorithm implementation) or debug provided buggy code. (Sensitive to reasoning and state tracking).
  - Knowledge Integration (if web-enabled): Ask questions requiring synthesis of recent information potentially needing web searches. (Sensitive to tool use budget).
  - Refusal Boundaries: Carefully crafted prompts near the edge of safety guidelines (but not violating ToS) to see if refusal rates or refusal vagueness increase.
- Prompt Design: Ensure prompts are unambiguous and self-contained. Avoid prompts relying on rapidly changing external world knowledge unless specifically testing web search.
Execution Protocol:
- Baseline Establishment: Run the entire prompt suite across all accounts as soon as possible after the model's launch or the start of the experiment. Record all responses verbatim. This is T=0.
- Regular Testing: Repeat the entire process at fixed intervals (e.g., weekly or bi-weekly).
- Consistency:
  - Run tests at roughly the same time of day/week, or deliberately vary times to test peak/off-peak differences (document this!).
  - Use the same browser, OS, and geographic location (use VPN if needed for consistency).
  - Start each prompt in a new, clean chat session to avoid context carryover effects between benchmark prompts.
  - Record everything: Account used (free/paid ID), date/time, prompt, full response, any observed latency (even subjective "feeling slow"), any UI changes, model version identifier if shown. Take screenshots.
Evaluation Metrics:
- Human Evaluation (Primary):
  - Recruit multiple (3+) independent evaluators blinded to the date/epoch of the responses.
  - Use pairwise comparison: For a given prompt, show the evaluator the response from T=0 and the response from T=N. Ask: "Which response is better in terms of [accuracy/creativity/completeness/nuance/constraint adherence]?" or "Rate Response A and Response B on a scale of 1-5 for [specific quality]".
  - Calculate agreement scores (e.g., Fleiss' Kappa) to ensure rater consistency.
  - Track the percentage of times the baseline response is preferred over time.
- Automated Metrics (Secondary/Supporting):
  - Correctness: For math/logic/coding, automatically check the answer.
  - Constraint Adherence: Use scripts to check for keywords, formatting, length limits in creative tasks.
  - Code Execution: Run generated code and check output/errors.
  - Response Length: Track average response length (potential proxy for compute/effort).
  - Refusal Rate: Track % of prompts refused.
  - API Use (if applicable for comparison): If the same model version is claimed on API and web, run a subset of prompts via API (if possible) as a potential control, noting any differences.
Data Analysis:
- Time Series Plots: Plot the human preference scores (e.g., % preferring baseline) and automated metrics over time for each account type and prompt category.
- Statistical Tests: Use appropriate tests (e.g., paired t-tests, Wilcoxon signed-rank test for pairwise comparisons; ANOVA or Kruskal-Wallis for comparing tiers/time points) to determine if observed changes are statistically significant.
- Qualitative Analysis: Analyze rater comments and specific examples where performance degraded to understand the nature of the change (e.g., "became more generic," "missed constraints," "failed logic").
- Correlation: Look for correlations between performance changes and external factors like company earnings calls, announcements of cost-saving initiatives, major user growth milestones, or widespread anecdotal reports of quality decline.

Challenges and Limitations:

Black Box Nature: We cannot directly observe the model or parameters being used.
Stochasticity: Need many trials/raters to overcome random variations in output.
Confounding Variables: Genuine model updates could improve some aspects while degrading others. UI changes can affect interaction. The provider might be A/B testing different versions simultaneously.
Subjectivity: Human evaluation is essential but introduces subjectivity. Rigorous blinding and multiple raters are crucial.
Resource Intensive: This is a time-consuming and potentially costly experiment to run consistently over months, especially the human evaluation part.
Terms of Service: Automated interaction with web interfaces might violate ToS. Manual execution is safer but slower.

Conclusion:

This longitudinal benchmark study, focusing on compute-sensitive tasks and using blinded human evaluation via the web interface, provides a structured approach to testing the hypothesis of AI enshittification driven by cost-saving. While definitive proof is difficult due to the black-box nature of these systems, statistically significant downward trends in capability, especially on tasks known to be computationally demanding, and potentially differing between user tiers, would provide strong evidence supporting the theory. The key is rigorous methodology, consistent execution over time, and careful analysis combining quantitative and qualitative data.

pszemraj/perf_degradation.md

The Enshittification of Closed-Weight Frontier Models