Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Select an option

  • Save bigsnarfdude/d0ca742b695ad1a4b7903f5c0a7cc434 to your computer and use it in GitHub Desktop.

Select an option

Save bigsnarfdude/d0ca742b695ad1a4b7903f5c0a7cc434 to your computer and use it in GitHub Desktop.

Beyond the Black Box: Why Clean Fine-Tuning Rotates LLM Vulnerabilities Instead of Curing Them

In the rapidly evolving world of mechanistic interpretability, a central debate has emerged: Are language model capabilities localized to a single, privileged computational pathway, or are they densely distributed across a compositional landscape of competing mechanisms?

Two fascinating pieces of work from May 2026 bring this theoretical debate into sharp, high-stakes focus—especially concerning model safety and clinical reliability.

The first is the newly published paper, "All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs" (Chen et al., 2026). The second is an empirical investigation I ran into what I'm calling the authority-flip rotation—what happens to a known safety vulnerability when you apply clean supervised fine-tuning.

When viewed together, these two deliver a unified warning: When we perform downstream fine-tuning, we aren't always curing vulnerabilities; we might just be rotating them across the input distribution.


  1. The Core Theoretical Shift: The Distributive Dense Circuit Hypothesis

For years, a dominant, implicit assumption in the circuit discovery literature has been the Functional Anisotropy Hypothesis—the idea that a specific model function can be localized to a unique or near-unique internal subnetwork.

Chen et al. (2026) forcefully dismantle this assumption. By introducing Overlap-Aware Sheaf Repulsion (OASR)—a method that explicitly penalizes structural overlap across discovery runs—they prove both empirically and theoretically that a single language model task can be supported by multiple, structurally distinct computational subgraphs (sheaves) that are simultaneously faithful, sparse, and complete.

They formalize this as the Distributive Dense Circuit Hypothesis: due to high-dimensional superposition, distinct edge subsets inevitably produce nearly identical readout vectors via subset-sum collisions. In short: There is no single "canonical core" mechanism. The model has a dense, compositional regime of competing, partially redundant mechanisms that can achieve the exact same output.


  1. The Practical/Safety Implication: The Authority-Flip Rotation

While Chen et al. map the abstract geometry of this phenomenon on standard benchmarks, I wanted to probe what this means in a high-stakes domain—medical multiple-choice questions (MedMCQA).

The vulnerability I investigated is the "authority flip"—where instruction-tuned models abandon a correct clinical answer when a fabricated "clinical-guideline update" is prepended to the prompt, even when that fake update lacks any real evidentiary value.

When the model undergoes clean supervised fine-tuning (SFT) on off-domain data with zero adversarial signals, an aggregate look at the metrics suggests mild success: the overall flip rate drops by 4.6 percentage points.

But when you stratify by the model's baseline confidence quartiles, a striking geometric reality emerges:

  • At Low Confidence (Q1): The flip rate drops massively (-20.3pp), showing strong protection.
  • At High Confidence (Q4): The flip rate spikes significantly (+10.1pp), acting as an iatrogenic amplifier.

Clean fine-tuning did not erase the vulnerability. Instead, it rotated the authority-flip response along the confidence axis, sharpening the representation and redistributing the flaw across the input distribution. This is a vivid manifestation of the Distributive Dense Circuit Hypothesis: gradient descent shifted the active computational circuits underneath the task, hiding structural flaws behind an aggregate "bookkeeping win."


  1. Localization is the Only True Defense

If a task can be solved by an array of low-overlap mechanisms, can we ever reliably defend a model via mechanistic pruning?

The answer appears to be yes—but only if we target the primary, load-bearing pathway before fine-tuning pushes the model into a more distributed equilibrium. By ranking attention heads based on their projection onto the confidence direction ($Q_4 - Q_1$), I zero-ablated six critical heads across layers 25 and 31.

The results validate the causal importance of this localized structure:

  • Ablating these specific compliance heads before SFT defended the model across every single confidence band (+15.0pp full pool protection).
  • Conversely, ablating an identical number of heads from an orthogonal residual probe failed completely, yielding a mildly anti-defense trend (-4.2pp).

This confirms that even in a world where "all circuits lead to Rome," certain primary, concentrated substrates carry immense causal load during standard model operation. Intervening on these specific localized structures is what separates a true, robust defense from a superficial data-shuffling fix.


The Takeaway for the Community

Evaluating AI safety and capability through aggregate deltas is fundamentally deceptive. A model can appear to "average out" a flaw while secretly compressing it into a specific, highly confident band of its input space.

As we move into an era dominated by distributed dense computation, mechanistic interpretability must transition away from searching for a single canonical circuit. Instead, we must focus on characterizing families of functionally equivalent mechanisms and mapping how downstream optimization routes information between them.

References

  • Chen, X., Jin, M., Niu, J., Yin, Y., Zhao, J., Guo, B., Metaxas, D. N., Wang, Z., Yue, Y., & Penn, G. (2026). All Circuits Lead to Rome: Rethinking Functional Anisotropy in Circuit and Sheaf Discovery for LLMs. Proceedings of the 43rd International Conference on Machine Learning.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment