Skip to main content
Frontier Signal

arXiv: Perturbation Probing Reveals LLM Behavioral Circuits

Perturbation Probing, a new diagnostic technique, identifies specific FFN neuron circuits controlling LLM behaviors like safety refusal and language selection with high precision.

Operator Briefing

Turn this article into a repeatable weekly edge.

Get implementation-minded writeups on frontier tools, systems, and income opportunities built for professionals.

No fluff. No generic AI listicles. Unsubscribe anytime.

A new diagnostic technique called Perturbation Probing identifies specific Feed-Forward Network (FFN) neuron circuits responsible for distinct behaviors in large language models (LLMs), such as safety refusal and language selection. Published on , this method uses two forward passes per prompt and no backpropagation to causally pinpoint these circuits, enabling targeted interventions like ablating 50 neurons to alter 80% of safety refusal formats with minimal harmful compliance, or switching language output with high accuracy in specific models.

  • Perturbation Probing identifies FFN neuron circuits for LLM behaviors using just two forward passes per prompt, without backpropagation.
  • It distinguishes two circuit types: “Opposition circuits” for RLHF-suppressed behaviors (e.g., safety refusal) and “Routing circuits” for pre-training behaviors (e.g., language selection).
  • Targeted interventions are highly effective: ablating 50 neurons changed 80% of safety refusal formats while maintaining safety, and injecting residual-stream direction switched language in 99.1% of cases in certain models.
  • The FFN-to-skip signal ratio, derived from the same two passes, predicts circuit structure and intervention suitability.
  • Circuit topology varies significantly across architectures, impacting the effectiveness of interventions.

What changed

The paper “Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs” introduces a novel, efficient method for mechanistic interpretability in LLMs. Unlike prior approaches that often require extensive fine-tuning, backpropagation, or complex architectural modifications, Perturbation Probing identifies causal behavioral circuits in FFN neurons using a minimal two-pass-per-prompt diagnostic. This allows for a rapid, hypothesis-driven understanding of how specific neurons contribute to model outputs. The technique also introduces the FFN-to-skip signal ratio, a new metric that helps predict the underlying circuit structure and the most effective intervention strategy. This represents a significant leap from general interpretability tools to a practical toolkit for precision template-layer editing, offering granular control over specific LLM behaviors.

How it works

Perturbation Probing operates on a two-pass-per-prompt principle. For each prompt, the model performs a standard forward pass. Then, a second forward pass is executed with a slight perturbation to a specific FFN neuron or set of neurons. By comparing the output of these two passes, the method can infer the causal role of the perturbed neurons in generating the response. This process is repeated across various neurons and prompts to build a task-specific causal hypothesis for FFN neurons.

After identifying candidate neurons, a one-time intervention sweep, involving approximately 150 passes, is conducted to validate and refine the identified circuits. The paper identifies two primary circuit structures:

  • Opposition circuits: These emerge when Reinforcement Learning from Human Feedback (RLHF) has been used to suppress a pre-training tendency. A prime example is safety refusal. In this case, specific FFN neurons (around 50, or 0.014% of all neurons) are found to control the refusal template. Interventions here typically involve ablating these neurons.
  • Routing circuits: These are associated with pre-training behaviors distributed through attention mechanisms. Language selection is a key example. For these circuits, interventions often involve injecting a specific residual-stream direction. This is effective in models satisfying three conditions: bilingual training, an FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability.

Crucially, the FFN-to-skip signal ratio, calculated from the initial two forward passes, serves as a diagnostic indicator. It helps distinguish between opposition and routing circuits and predicts which type of intervention (ablation or direction injection) will be most effective. The research applied this method across eight behavioral circuits, 13 models, and four architecture families, revealing that circuit topology varies significantly, for instance, from Qwen’s concentrated FFN bottleneck to Gemma’s normalization-shielded circuits.

Why it matters for operators

For operators building, deploying, or fine-tuning LLMs, Perturbation Probing offers an unprecedented level of surgical control over model behavior. The ability to identify and manipulate specific FFN neuron circuits with such precision moves us beyond the blunt instrument of prompt engineering or broad fine-tuning. Consider the implications for safety and alignment: instead of relying on ever-more complex guardrail prompts or expensive, iterative RLHF, an operator could potentially pinpoint the exact neurons responsible for generating harmful content and ablate them with minimal impact on other capabilities. The paper’s finding that ablating 50 neurons changed 80% of safety refusal formats on 520 AdvBench prompts while producing near-zero harmful compliance (3 of 520 cases, all with disclaimers) is a stark demonstration of this. This isn’t just about preventing bad outputs; it’s about understanding why the model produces them and fixing the root cause.

Furthermore, the discovery of routing circuits and their conditions for intervention (bilingual training, FFN-to-skip ratio, linear representability) provides a roadmap for targeted feature engineering. If an operator needs to reliably switch output languages or control other pre-training behaviors, this research suggests specific model architectures and training regimes are more amenable to such control. This means less trial-and-error and more deterministic outcomes for specialized LLM applications. Our take is that this work will accelerate the shift from black-box LLM deployment to a more transparent, controllable, and ultimately safer operational paradigm. Operators should begin to factor in mechanistic interpretability as a core requirement for future model selection and customisation, prioritizing models whose internal structures are more amenable to this kind of precise intervention.

Benchmarks and evidence

The research provides several concrete examples of the method’s effectiveness:

  • Safety Refusal: For safety refusal, approximately 50 neurons (0.014% of all neurons) were identified as controlling the refusal template. Ablating these neurons changed 80% of response formats on 520 AdvBench prompts, while resulting in near-zero harmful compliance (3 of 520 cases, all with disclaimers).
  • Language Selection: In 3 out of 19 tested models that met specific conditions (bilingual training, FFN-to-skip signal ratio between 0.3 and 1.1, and linear representability), residual-stream direction injection successfully switched English to Chinese output on 99.1% of 580 benchmark prompts. The intervention failed on the other 16 models, as well as on math, code, and factual circuits, demonstrating the specificity of the method.
  • Sycophancy and Factual Correction (Qwen3.5-2B): In Qwen3.5-2B, ablating 20 specific neurons eliminated multi-turn sycophantic capitulation. Conversely, amplifying 10 related neurons improved factual correction from 52% to 88% on 200 TruthfulQA prompts.

Risks and open questions

  • Generalizability Across Behaviors: While effective for safety refusal and language selection, the paper notes that directional steering failed on math, code, and factual circuits. This suggests that not all LLM behaviors are equally amenable to this type of intervention, and further research is needed to understand the limitations.
  • Architectural Dependence: The effectiveness of interventions is highly dependent on model architecture (e.g., Qwen’s concentrated FFN bottleneck vs. Gemma’s normalization-shielded circuit). This implies that operators might need to develop architecture-specific strategies, and a universal “fix” might not apply.
  • Scalability to Larger Models: The paper tested 13 models. As models grow exponentially in size, the computational cost of even a two-pass-per-prompt diagnostic, followed by a 150-pass intervention sweep, could become substantial, especially for identifying and validating numerous circuits.
  • Unintended Side Effects: While the safety refusal intervention showed minimal harmful compliance, any surgical intervention carries the risk of unforeseen side effects on other, seemingly unrelated, model capabilities. Thorough, broad-spectrum evaluation would be crucial before deployment.
  • Defining “Harmful Compliance”: The paper states “near-zero harmful compliance, 3 of 520 cases, all with disclaimers.” The precise definition and severity of these “harmful” instances, even with disclaimers, warrant deeper scrutiny for real-world high-stakes applications.

Author

  • Siegfried Kamgo

    Founder and editorial lead at FrontierWisdom. Engineer turned operator-analyst writing about AI systems, automation infrastructure, decentralised stacks, and the practical economics of frontier technology. Focus: turning fast-moving releases into durable, implementation-ready playbooks.

Keep Compounding Signal

Get the next blueprint before it becomes common advice.

Join the newsletter for future-economy playbooks, tactical prompts, and high-margin tool recommendations.

  • Actionable execution blueprints
  • High-signal tool and infrastructure breakdowns
  • New monetization angles before they saturate

No fluff. No generic AI listicles. Unsubscribe anytime.

Leave a Reply

Your email address will not be published. Required fields are marked *