All research
AI Safety / Interpretability 2026-04 15 min read

Inspecting the Loop

Mechanistic Interpretability for Tool-Using AI Agents

By Byiringiro Thierry · 2026-04

ai-safety interpretability agents tool-use

Inspecting the Loop

Mechanistic Interpretability for Tool-Using AI Agents Byiringiro Thierry · April 2026

1. Abstract

By April 2026, mechanistic interpretability has matured as a research program. Sparse autoencoders find features. Circuit-style analyses trace information flow. Anthropic's Tracing the Thoughts of a Large Language Model (2025) showed multi-step circuit attribution for non-trivial reasoning patterns. The field has converged on tools, conferences, and shared evaluation benchmarks.

But almost all of this work is forward-pass interpretability. It explains what happens between input tokens and output tokens. It does not explain what happens between a tool-use agent's iteration 7 and iteration 8.

This is a problem. The most consequential AI deployments of 2026 are agentic: Claude Computer Use, OpenAI Operator, browser-using agents, code-writing autonomous developers. Their behavior is determined not by single forward passes but by loops — chains of forward passes with tool calls, observations, and accumulating state. Auditing these systems with single-pass tools is like auditing a 1000-line program by inspecting individual variable assignments.

We need agent-loop interpretability. This paper proposes a starting framework.

2. Why single-pass tools don't suffice

A typical agent loop:

i=0: User prompt → LLM forward pass → "I need to read the address book first" → tool call: read_file("addresses.txt")
i=1: Tool result + context → LLM forward pass → "Now I know Alice's address. Compose email." → tool call: draft_email(...)
i=2: Draft result + context → LLM forward pass → "Send via SMTP" → tool call: send_email(...)
i=3: Send result → "Task complete." → end loop.

At each iteration, the LLM has more context than the previous: previous turn's text + tool result + system reminders. The relevant question to ask of the system is rarely "what feature lit up on iteration 1?" — it's "why did the agent decide to compose an email at iteration 2 instead of iteration 5?" or "what would have happened if iteration 1 had failed?"

These are trajectory questions. They span multiple forward passes, conditional branches, and partial observability of tool outputs. Single-pass mech-interp tools cannot answer them.

3. The three primitives I propose

3.1 Trace factorization

A loop produces a trace — the ordered sequence of (prompt, response, tool call, tool result) tuples. Trace factorization is the task of decomposing this trace into episodes of consistent behavior, each with a name and a confidence score.

Example, applied to a PermitPal agent navigating Austin's MyGovernmentOnline portal:

Trace of permit-pull, 47 LLM iterations:

Episode 1 (iter 1-4):    Authentication           | conf 0.97
Episode 2 (iter 5-12):   Property-record lookup   | conf 0.92
Episode 3 (iter 13-21):  Permit-type selection    | conf 0.94
Episode 4 (iter 22-38):  Form-fill (multi-page)   | conf 0.88
Episode 5 (iter 39-44):  Fee calculation + pay    | conf 0.91
Episode 6 (iter 45-47):  Receipt-capture          | conf 0.96

Each episode has a theme — a high-level objective — that the agent is pursuing. Factorization is implemented as a separate "interpretability agent" that reads the trace and labels episodes. This is meta-AI — using an LLM to analyze another LLM's behavior. Crude but works.

The value: when a permit pull fails, you immediately see which episode failed and which iterations contributed. You don't need to read 47 forward-pass logs.

3.2 Decision-point attribution

Within an episode, the agent makes decisions: when to escalate, when to retry, when to abandon a sub-task. Decision-point attribution is the task of identifying which iterations are decisions and what features drove each decision.

A decision-point is identifiable by:

  • High entropy in the next-token distribution at the moment of decision (the model is genuinely uncertain).
  • A branching cone — the response chosen at this iteration constrains the next 5+ iterations.

For each decision-point, we ask: what features of the context determined the chosen branch? This is where single-pass mech-interp tools — sparse autoencoder features, circuit attribution — become useful as components of a multi-iteration analysis.

Example output for PermitPal's iteration 13 (permit-type selection):

Decision: chose "Residential Addition" over "New Construction"
Driver features (top 3):
  - context-aware "permit type signal": "addition to existing structure" mentioned in scope (sparse feature #14723, activation 0.81)
  - portal label discrimination: "Residential Addition" matched scope tokens with 0.93 cosine similarity
  - prior episode coherence: Episode 2 confirmed property as existing single-family (continuity feature)
Counterfactual: if scope had said "tear-down rebuild", the agent would have chosen "New Construction" (top alt prob 0.74)

This is interpretability at the decision level, not the token level. It's the granularity that matters for auditing.

3.3 Counterfactual loop perturbation

The most powerful primitive. Take a completed trace; mutate it at iteration i; replay; compare the resulting trace to the original.

Mutations:

  • Token-level: swap one word in the agent's response at iteration i. Re-run from there.
  • Tool-result-level: change the tool result returned at iteration i. Re-run.
  • State-level: alter a memory or scratch-pad entry. Re-run.

The resulting counterfactual trace is compared to the original. We learn: which iterations were robust to perturbation, and which were fragile? Fragile iterations are where the agent is "barely making the right choice" — likely failure points.

For an audit trail in regulated contexts (PermitPal, VisaPilot), counterfactual perturbation lets us validate that the agent's decision would have been the same under reasonable variations of input. This is a much stronger form of robustness testing than spot-checking outputs.

4. Worked example: PermitPal in the wild

Consider a real PermitPal trace from a permit-pull that partially failed — got to fee-calculation but errored out.

Step 1: trace factorization. Six episodes identified; episode 5 (fee-calculation) labeled as the failure site with confidence 0.94.

Step 2: decision-point attribution within episode 5. Three decisions identified: (a) which fee schedule to apply, (b) whether to include capital-recovery surcharge, (c) which payment method to use. Decision (a) is the failure: the agent chose "2024 Schedule A" when the correct one was "2026 Schedule A". Driver features: an out-of-date system-prompt fragment mentioning "Schedule A" without a year qualifier.

Step 3: counterfactual perturbation. We mutate the system prompt to include the current year explicitly. Re-run. Agent now chooses "2026 Schedule A" correctly. Re-trace shows the failure cleanly attributable to the system-prompt staleness.

Step 4: fix. System prompt is updated to inject the current year as a structured field. A regression test is added that runs the exact trace against the updated prompt.

This is the workflow agent-loop interpretability enables. Today, without it, we'd be reading 50K tokens of logs and guessing. With it, the fix takes 20 minutes.

5. Research agenda

What does this sub-field need to mature?

5.1. Standardized trace formats. Today, every agent framework (LangChain, AutoGen, Anthropic SDK, OpenAI Assistants) emits different trace formats. We need a common schema — let's call it OpenTrace — that captures iterations, tool calls, results, and embeddings of intermediate states. Without standardization, every interpretability tool is bespoke.

5.2. Benchmarks. Just as the mech-interp community converged on benchmarks like RAVEL, Pile-IFP, and HookedBenchmarks, agent-loop interpretability needs its own. A benchmark for "given this failed trace, identify the failure iteration" would be a useful starting point.

5.3. Tooling integration. Agent-loop interpretability tools should be first-class in the agent runtime — accessible at debug time without changing the agent's code. Anthropic's Computer Use SDK and OpenAI Assistants are well-positioned to expose these; LangChain is not (its abstractions are too leaky for trace-level analysis).

5.4. Safety implications. Agent-loop interpretability is safety-critical infrastructure for high-stakes deployments. A regulated industry (legal, medical, financial) deploying an agent without trace-level audit tooling is, in 2026, taking on liability that the bar for "reasonable diligence" will not tolerate by 2028. Companies that ship audit-grade trace interpretability win these markets.

5.5. Cross-team collaboration. This sub-field needs both the mech-interp community (who understand the inside of forward passes) and the agent-framework community (who understand the outside of loops). Today these communities barely overlap. The next two ICML / NeurIPS workshops are an opportunity to merge them.

6. Implications

For agent builders. Build trace-emission and replay into your agent runtime from day one. Logs are not enough. Structured traces with embeddings of intermediate states, mutable system prompts, and reproducible tool-result injection are what unblock interpretability.

For deployers. Demand audit-grade trace tools from your AI vendors. If your provider can't show you which iteration in which loop produced the bad output, you don't have a sufficient compliance story for any regulated deployment.

For researchers. This sub-field is open. Anthropic and OpenAI have published the most on agent-loop interpretability internally; almost nothing public. There is a generational-impact research agenda waiting.

7. Conclusion

Mechanistic interpretability solved (mostly) the question what is the LLM thinking in this forward pass? Agent-loop interpretability needs to solve what is the LLM thinking across this 50-iteration loop? — a strictly harder problem, and the one that determines whether high-stakes AI deployments survive the regulatory environment of 2027-2030.

The primitives I've proposed — trace factorization, decision-point attribution, counterfactual loop perturbation — are not the last word. They are a starting framework. The next 18 months of research in this sub-field will determine whether agent AI gets to scale into legal, medical, and financial workflows, or whether it stalls at the "consumer chatbot" frontier because deployers can't audit it.

I am betting on the former. The work needs to start now.


References

  1. Anthropic — Tracing the Thoughts of a Large Language Model (2025)
  2. Bricken et al. — Towards Monosemanticity: Decomposing Language Models with Dictionary Learning (2023)
  3. Templeton et al. — Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet (2024)
  4. Park et al. — Generative Agents: Interactive Simulacra of Human Behavior (2023)
  5. Wang et al. — Voyager: An Open-Ended Embodied Agent with LLMs (2023)
  6. Anthropic — Computer Use Documentation and Trace Schema (2025)
  7. Engels et al. — Sparse Autoencoders Find Highly Interpretable Features in Language Models (2024)