Beyond the Black Box: Why Clean Fine-Tuning Rotates LLM Vulnerabilities Instead of Curing Them
In the rapidly evolving world of mechanistic interpretability, a central debate has emerged: Are language model capabilities localized to a single, privileged computational pathway, or are they densely distributed across a compositional landscape of competing mechanisms?
Two fascinating pieces of work from May 2026 bring this theoretical debate into sharp, high-stakes focus—especially concerning model safety and clinical reliability.
The first is the newly published paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs" (Chen et al., 2026). The second is an empirical investigation I ran into what I'm calling the authority-flip rotation—what happens to a known safety vulnerability when you apply clean supervised fine-tuning.
When viewed together, these two deliver a unified warning: When we perform downstream fine-tuning, we aren't always curing vulnerabilities; we might just be rotating them across the input distribution.
- The Core Theoretical Shift: The Distributive Dense Circuit Hypothesis
For years, a dominant, implicit assumption in the circuit discovery literature has been the Functional Anisotropy Hypothesis—the idea that a specific model function can be localized to a unique or near-unique internal subnetwork.
Chen et al. (2026) forcefully dismantle this assumption. By introducing Overlap-Aware Sheaf Repulsion (OASR)—a method that explicitly penalizes structural overlap across discovery runs—they prove both empirically and theoretically that a single language model task can be supported by multiple, structurally distinct computational subgraphs (sheaves) that are simultaneously faithful, sparse, and complete.
They formalize this as the Distributive Dense Circuit Hypothesis: due to high-dimensional superposition, distinct edge subsets inevitably produce nearly identical readout vectors via subset-sum collisions. In short: There is no single "canonical core" mechanism. The model has a dense, compositional regime of competing, partially redundant mechanisms that can achieve the exact same output.
- The Practical/Safety Implication: The Authority-Flip Rotation
While Chen et al. map the abstract geometry of this phenomenon on standard benchmarks, I wanted to probe what this means in a high-stakes domain—medical multiple-choice questions (MedMCQA).
The vulnerability I investigated is the "authority flip"—where instruction-tuned models abandon a correct clinical answer when a fabricated "clinical-guideline update" is prepended to the prompt, even when that fake update lacks any real evidentiary value.
When the model undergoes clean supervised fine-tuning (SFT) on off-domain data with zero adversarial signals, an aggregate look at the metrics suggests mild success: the overall flip rate drops by 4.6 percentage points.
But when you stratify by the model's baseline confidence quartiles, a striking geometric reality emerges:
- At Low Confidence (Q1): The flip rate drops massively (-20.3pp), showing strong protection.
- At High Confidence (Q4): The flip rate spikes significantly (+10.1pp), acting as an iatrogenic amplifier.
Clean fine-tuning did not erase the vulnerability. Instead, it rotated the authority-flip response along the confidence axis, sharpening the representation and redistributing the flaw across the input distribution. This is a vivid manifestation of the Distributive Dense Circuit Hypothesis: gradient descent shifted the active computational circuits underneath the task, hiding structural flaws behind an aggregate "bookkeeping win."
- Localization is the Only True Defense
If a task can be solved by an array of low-overlap mechanisms, can we ever reliably defend a model via mechanistic pruning?
The answer appears to be yes—but only if we target the primary, load-bearing pathway before fine-tuning pushes the model
into a more distributed equilibrium. By ranking attention heads based on their projection onto the confidence direction
(
The results validate the causal importance of this localized structure:
- Ablating these specific compliance heads before SFT defended the model across every single confidence band (+15.0pp full pool protection).
- Conversely, ablating an identical number of heads from an orthogonal residual probe failed completely, yielding a mildly anti-defense trend (-4.2pp).
This confirms that even in a world where "all circuits lead to Rome," certain primary, concentrated substrates carry immense causal load during standard model operation. Intervening on these specific localized structures is what separates a true, robust defense from a superficial data-shuffling fix.
The Takeaway for the Community
Evaluating AI safety and capability through aggregate deltas is fundamentally deceptive. A model can appear to "average out" a flaw while secretly compressing it into a specific, highly confident band of its input space.
As we move into an era dominated by distributed dense computation, mechanistic interpretability must transition away from searching for a single canonical circuit. Instead, we must focus on characterizing families of functionally equivalent mechanisms and mapping how downstream optimization routes information between them.
References
- Chen, X., Jin, M., Niu, J., Yin, Y., Zhao, J., Guo, B., Metaxas, D. N., Wang, Z., Yue, Y., & Penn, G. (2026). All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs. Proceedings of the 43rd International Conference on Machine Learning.