Researchers are increasingly focused on verifying the correctness of code generated by large language models, a challenge currently met with resource-intensive external tests or fallible auxiliary models. Yicheng He, Zheng Zhao (University of Edinburgh), Zhou Kaiyu (Nanyang Technological University), and colleagues demonstrate a novel approach to assess an LLM’s functional correctness by analysing its internal computational structure. Their work proposes CodeCircuit, a method treating verification as a mechanistic diagnostic task, mapping algorithmic trajectories into line-level attribution graphs to identify structural signatures of sound reasoning. This research is significant because it establishes internal introspection as a decodable property for verifying generated code, offering a potentially more reliable and efficient alternative to existing methods, and confirms robustness across Python, C++, and Java.
Current methods for code verification often rely on external tools like unit tests or auxiliary LLM judges, which can be resource-intensive and limited by their own capabilities.
This work addresses a fundamental question: can an LLM’s correctness be assessed solely from its internal neural dynamics during code generation. The study proposes treating verification as a mechanistic diagnostic task, mapping the model’s algorithmic trajectory into detailed, line-level attribution graphs.
By decomposing complex residual flows, the researchers aimed to identify structural signatures within the model’s internal circuits that distinguish sound reasoning from logical failures. Analysis across Python, C++, and Java demonstrates that these intrinsic correctness signals are consistent regardless of the programming language used.
Topological features extracted from these internal graphs predict correctness more reliably than existing heuristic methods and, crucially, allow for targeted interventions to correct erroneous logic within the model itself. These findings establish internal introspection as a decodable property for verifying generated code, offering a new paradigm for assessing LLM reliability.
CodeCircuit utilises sparse autoencoders to decompose residual flows into interpretable features, constructing a causal graph that reveals the model’s internal reasoning process. The framework conducts an extensive analysis across multiple programming languages, focusing on the detailed line-level attribution graphs to expose the underlying failure mechanisms.
Results indicate that correct and incorrect code exhibit systematic differences in their attribution graph topologies, enabling CodeCircuit to outperform both black-box and gray-box verification methods. Furthermore, targeted interventions on nodes within the graph demonstrate a causal link between the attribution features and code correctness, allowing for active mechanistic debugging.
Per-Layer Transcoder implementation for attribution graph construction
Attribution Graphs (AGs) form the basis of this research, providing a causal, linear decomposition of a transformer model’s computation for a specific input prompt. The construction of an AG relies on the local replacement model, which substitutes standard Multi-Layer Perceptrons (MLPs) with Per-Layer Transcoders (PLTs).
A PLT generates a sparse feature vector f(l) at layer l from the residual stream x(l) using an encoder W(l)enc and a non-linearity σ, expressed as f(l) = σ(W(l)encx(l) + b(l)enc). These features are then trained to reconstruct the local MLP output via decoder weights W(l)dec, resulting in m(l) = W(l)decf(l) + b(l)dec.
Unlike standard Autoencoders, PLTs disentangle features within the dense activation space, enabling a more interpretable decomposition. The local replacement model corrects for any discrepancies between the true MLP output and the PLT output, represented as m(l)adj = m(l) + (m(l) − m(l)), effectively introducing a bias node.
This process yields a locally linearized computation where downstream quantities are linear functions of feature activations and residual stream components, with attention outputs and normalization statistics remaining fixed. Linearity is maintained with respect to the substituted MLP pathways during a frozen forward pass.
The AG itself is a directed acyclic graph G = (V, E), comprising nodes V that include active PLT features, token embeddings, error terms, and output logits. Edges wij ∈ E, connecting a source node i at layer l to a target node j at layer l′ > l, represent the linear contribution of node i’s activation to the pre-activation of node j.
This contribution is quantified as wij = ai v⊤in,j Jij vout,i, where ai denotes the activation of node i, vin,j and vout,i represent the residual stream directions, and Jij is the Jacobian of the frozen residual stream transformation. This methodology was applied across Python, C++, and Java to assess the robustness of intrinsic correctness signals.
Internal graph topology predicts code validity across multiple programming languages
Attribution graphs reveal systematic topological differences between correct and incorrect code construction steps across Python, C++, and Java. Analysis of these graphs demonstrates that internal attribution structures correlate with code correctness, irrespective of the programming language used. Specifically, the research establishes a mechanistic link between a language model’s internal computations and the logical validity of generated code.
Topological features extracted from these internal graphs consistently outperform both black-box methods, such as Temperature Scaling, and gray-box methods, like Chain-of-Embedding, in verification performance. The study utilises CodeCircuit, a scientific instrument designed for precise auditing and debugging of code generation, providing mechanistic insights beyond those achievable with coarser methods.
CodeCircuit decomposes complex residual flows into a causal graph of interpretable features, enabling detailed examination of failure mechanisms at the line level. By constructing attribution graphs, the work identifies structural signatures that distinguish sound reasoning from logical failure within the model’s internal circuits.
These graphs are built upon a local replacement model, substituting standard Multi-Layer Perceptrons with Per-Layer Transcoders to create sparse feature vectors. Crucially, targeted interventions on nodes within the attribution graph demonstrate the ability to causally correct erroneous code. This confirms that the graph captures the functional mechanism of generation, moving beyond passive verification towards active mechanistic debugging.
The local replacement model linearly approximates computations, allowing for the tracing of information flow through interpretable features and the quantification of feature contributions. Edge weights within the graph represent the linear contribution of a node’s activation to the pre-activation of subsequent nodes, calculated using a specific formulation involving Jacobians and residual stream directions.
Nodes and edges are pruned based on their attribution to the final model output, resulting in a sparse and mechanistically faithful circuit. The objective is to predict the correctness label for each logical step within a code snippet, formulating code verification as a mechanistic sequence labelling problem.
Internal Graph Topology Predicts Code Validity Across Multiple Languages
Researchers have demonstrated that the functional correctness of large language models generating code can be assessed by examining the model’s internal computational structure. This investigation moves away from traditional verification methods that rely on external tests or judging models, instead focusing on whether the model’s neural dynamics encode signals indicative of logical validity during code generation.
By treating verification as a mechanistic diagnostic task, the study maps the model’s algorithmic trajectory into line-level attribution graphs, effectively decomposing complex residual flows to identify structural signatures of sound reasoning. Analysis across Python, C++, and Java confirms the robustness of these intrinsic correctness signals, irrespective of programming language syntax.
Topological features extracted from these internal graphs prove more reliable in predicting correctness than conventional heuristics, and furthermore, allow for targeted interventions to rectify flawed logic. These findings establish internal introspection as a decodable property for verifying generated code, suggesting a viable path towards assessing code correctness directly from model behaviour rather than relying on external supervision.
The authors acknowledge that this mechanistic verification should currently complement, rather than replace, human review and established software testing procedures. While the framework, named CodeCircuit, advances the goal of transparent and interpretable machine learning for automated software engineering, it is not presented as a complete solution to ensuring code reliability. Future work could explore the application of this internal analysis to more complex software projects and agentic frameworks, potentially enhancing the early detection of logical failures and vulnerabilities that might be missed by standard testing methods.
.
👉 More information
🗞 CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs
🧠 ArXiv: https://arxiv.org/abs/2602.07080
