Researchers are increasingly interested in the self-reflection capabilities of R1-style large language models, but the underlying cognitive processes driving this behaviour have remained elusive. Yanrui Du, Yibo Gao and Sendong Zhao from Harbin Institute of Technology, alongside Jiayun Li, Haochun Wang from Harbin Institute of Technology, and Qika Lin from the National University of Singapore, have traced the layer-wise activation trajectory to understand how these models initiate and execute reflection. Their work reveals a structured progression from latent monitoring of ‘thinking budget’ in early layers, through discourse-level cue processing in intermediate layers, to the emergence of reflection-related tokens in later layers. This research is significant because it demonstrates a potential human-like meta-cognitive process within these artificial intelligence systems, offering valuable insight into the mechanisms enabling complex behaviours and paving the way for more interpretable and controllable language models.
This work focuses on R1-style LLMs, which exhibit an inherent capacity for self-reflection, marked by the emission of tokens such as “Wait” and “Hmm”.
By tracing layer-wise activation trajectories, scientists have identified three distinct stages in this process: latent-control layers, semantic-pivot layers, and behavior-overt layers. The study employs a novel approach using a “logit lens” to decode token-level semantics from intermediate-layer activations, providing unprecedented insight into the model’s internal state.
Initial analysis reveals that latent-control layers encode a “thinking budget”, represented by an approximate linear direction within the model’s activations, responding to prompt-level cues. Moving deeper, semantic-pivot layers demonstrate a dominance of discourse-level cues, specifically turning-point and summarization signals, which significantly influence the probability mass distribution.
Finally, in behavior-overt layers, the likelihood of sampling reflection-behavior tokens demonstrably increases, culminating in the generation of self-reflective markers. These findings suggest a hierarchical organization, where initial latent monitoring transitions into discourse-level regulation, ultimately resulting in overt self-reflection.
Further investigation employed targeted interventions to establish a causal chain connecting these stages. Manipulating prompt-level semantics demonstrably modulates activation projections along latent-control directions, inducing competition between turning-point and summarization cues within the semantic-pivot layers.
This competition, in turn, directly regulates the sampling likelihood of reflection-behavior tokens in the behavior-overt layers. Experiments involving activation steering within the latent-control layers corroborated these findings, reinforcing the existence of a coherent depth-wise causal pathway. The research, detailed with analysis code available at https://github.com/DYR1/S3-CoT, provides a mechanistic understanding of reflective behaviour, potentially enabling improved predictability, control, and refinement of these capabilities in future LLMs.
Decoding Reflective Thought via Latent Control and Discourse Marker Analysis offers new insights into cognitive processes
A logit lens serves as the primary tool for reading out token-level semantics from intermediate-layer activations within the R1-style large language model. This technique applies the output decoder to activations at each layer, enabling the characterisation of functional roles across different depths of the network.
Researchers first decoded contrastive activations derived from paired prompts designed to either encourage or discourage reflective thinking. This comparative approach isolates latent control signals, revealing a contiguous block of early-to-mid layers responsible for encoding a ‘thinking budget’, termed latent-control layers.
Following this, the study traced activations along the model’s forward pass, focusing on the emergence of the “Wait” token as a reflection marker. Analysis of later intermediate layers revealed a distinct shift in probability mass towards discourse cues, specifically turning-point tokens such as “but” or “however”, and summarization tokens like “so” or “therefore”.
These layers, designated semantic-pivot layers, demonstrate a critical role in processing and regulating discourse-level information. Subsequently, the final layers, or behavior-overt layers, exhibited an increasing likelihood of sampling reflection-behavior tokens, culminating in the overt expression of self-reflection.
To validate the proposed stage-wise organization, targeted interventions were implemented to establish a depth-wise causal chain. These interventions included injecting explicit semantics at the prompt level and applying activation steering within the latent-control layers. For example, discouraging reflection at the prompt level shifted activations towards a quick-thinking direction in the latent-control layers.
This perturbation propagated to the semantic-pivot layers, decreasing the probability of turning-point tokens and increasing that of summarization tokens, ultimately reducing the sampling likelihood of reflection markers in the behavior-overt layers. Fine-grained activation-steering experiments corroborated these findings, supporting a coherent causal chain from latent monitoring to discourse regulation and overt self-reflection.
Latent Control and Semantic Pivots Define Reflective Processing in Large Language Models, enabling more nuanced and adaptable responses
Researchers identified a structured, depth-wise progression underlying reflective behaviour in R1-style large language models, beginning with latent-control layers where thinking budget semantics are encoded along approximate linear directions. Analysis of DeepSeek-R17B and Qwen3-Think4B models, utilising a logit lens, revealed that these latent-control layers exhibit discernible separation between contrastive prompt pairs around Layer 8 and Layer 11 respectively, a separation that persists through to the final layer.
This initial divergence suggests that semantics related to thought processes are organised along a linear trajectory within the model’s representation space. Moving deeper, semantic-pivot layers demonstrate dominance of discourse-level cues, including turning-point and summarisation signals, within the probability mass.
Specifically, layer-wise activation-difference decoding highlighted that around Layer 15 of DeepSeek-R17B and Layer 22 of Qwen3-Think4B, decoded tokens strongly indicated thinking-budget semantics. Probing these layers with contrasting prompts showed that positive activation differences surfaced cues associated with detailed thinking, while negative differences yielded semantics related to conciseness.
Finally, behaviour-overt layers witnessed a rise in the likelihood of reflection-behaviour tokens, ultimately becoming highly probable for sampling. The study employed data from the GSM8K benchmark, sampling 200 problems and generating 200 matched prompt pairs with either reflection-encouraging or reflection-suppressing suffixes.
A total of 535 reflection-onset samples were collected for DeepSeek-R17B and 1553 for Qwen3-Think4B, aligning analysis to the emission of the “Wait” marker token. Targeted interventions, including prompt-level semantic manipulation and activation steering, confirmed a causal chain across these stages, modulating the sampling likelihood of reflection markers.
Layered activation dynamics underpin emergent reflective capacity in neural networks
Researchers have identified a structured progression of internal signals within large language models that supports reflective behaviour. This progression unfolds across layers of the model, beginning with latent-control layers where semantic information relating to a ‘thinking budget’ is encoded. This initial stage transitions into semantic-pivot layers, characterised by discourse-level cues such as turning points and summarisation, which dominate the probability of token selection.
Finally, behaviour-overt layers witness a rise in the likelihood of tokens associated with reflective behaviour, culminating in their frequent sampling. Interventions targeting these stages demonstrate a causal chain, where prompt semantics influence activation projections in latent-control layers, subsequently affecting competition between discourse cues in semantic-pivot layers and ultimately regulating the sampling of reflection-related tokens in behaviour-overt layers.
This suggests a meta-cognitive process mirroring human thought, progressing from internal monitoring to discourse regulation and culminating in overt self-reflection. Analysis across different domains, including medical question answering, confirms this stage-wise pattern and indicates its generalisability.
The authors acknowledge that their work is primarily analytical and does not address deployment-facing systems, limiting immediate societal impacts. Future research could focus on regulating reflection through interventions in latent-control layers, forecasting reflection by modelling activation shifts, and enhancing reflection by injecting supervision into intermediate layers, offering a principled basis for predicting, controlling, and improving reflective behaviours in these models.
👉 More information
🗞 From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs
🧠 ArXiv: https://arxiv.org/abs/2602.01999
