Large language models (LLMs) are increasingly powerful, yet a detailed understanding of how they arrive at predictions remains elusive. Akshat Gupta, Jay Yeung, and Gopala Anumanchipalli, all from the University of California, Berkeley, alongside Anna Ivanova from Georgia Institute of Technology, now reveal a structured pattern in LLM computations, demonstrating that these models do not utilise their full depth uniformly. Their research traces the internal workings of LLMs during inference, proposing a “Guess-then-Refine” framework that explains how initial, statistically-driven predictions evolve into contextually appropriate responses. The team’s analysis, encompassing part-of-speech tagging, fact recall, and multiple-choice question answering, demonstrates that LLMs refine predictions across layers, even correcting frequently occurring tokens over 70% of the time, and provides crucial insights into optimising computational efficiency in these complex systems.

Decoding LLM Representations with TunedLens Probing

This research investigates how information is captured within the layers of large language models (LLMs). The team employed a technique called TunedLens to decode hidden representations, revealing what each layer learns during processing. The core of the method involves training a probe to predict the next word in a sequence based on the information available at each intermediate layer. Researchers compared TunedLens to other probing methods, finding it more accurate, particularly in earlier layers. The results show that early layers prioritize predicting high-frequency words, such as “the,” “a,” and “of.

” To ensure this wasn’t an artifact of the probing technique, the team conducted further experiments, modifying TunedLens to reduce its focus on these common words, yet the early layers still strongly favored high-frequency tokens, demonstrating genuine encoding of this information. Probability analysis confirmed that TunedLens accurately reflects the information content in the early layers, unlike other methods that underestimate the importance of common words. This research demonstrates that early layers of LLMs capture basic linguistic information, specifically word frequency, and TunedLens provides a reliable method for understanding how LLMs process information.

Layer Dynamics Reveal Guess-then-Refine Prediction

This study provides a detailed investigation into how large language models (LLMs) utilize their internal layers during prediction, revealing a “Guess-then-Refine” framework. Researchers traced intermediate representations within several open-weight LLMs, including GPT2-XL, Pythia-6. 9B, Llama2-7B, and Llama3-8B, during inference to understand their layer-wise prediction dynamics. They used the TunedLens probe to decode these intermediate representations and analyze token predictions at each layer. The team categorized vocabulary tokens by their frequency within a large corpus, the English Wikipedia, dividing them into four groups.

This allowed them to track how prediction patterns shifted across different frequency ranges. By providing prefixes from the English Wikipedia and recording the top-ranked token at each layer, researchers observed how initial predictions evolved. The results demonstrate that early layers heavily favor high-frequency tokens, proposing them as initial guesses due to limited contextual information. Over 75% of top-ranked tokens for Pythia-6. 9B at the first layer belonged to the most frequent group, a pattern consistently observed across all tested models.

However, this reliance on frequency diminishes with depth, as deeper layers increasingly replace these guesses with rarer, contextually appropriate tokens. Crucially, the team quantified this refinement, finding that even high-frequency token predictions from early layers are refined over 70% of the time, demonstrating that correct prediction is not a “one-and-done” process. This detailed analysis provides a novel understanding of how LLMs leverage both statistical probabilities and contextual understanding to generate coherent text.

LLMs Employ Guess-then-Refine Prediction Framework

This work details a comprehensive investigation into how large language models (LLMs) utilize their internal layers during prediction, revealing a “Guess-then-Refine” framework that governs their computational process. Researchers leveraged the TunedLens probe to analyze intermediate layer representations, quantifying token prediction patterns across several open-weight models including GPT2-XL, Pythia-6. 9B, Llama2-7B, and Llama3-8B. The study demonstrates that LLMs initially propose high-frequency tokens as potential predictions in early layers, subsequently refining these guesses into contextually appropriate tokens in deeper layers.

Experiments revealed that the top-ranked predictions in early layers are largely composed of high-frequency tokens, accounting for over 75% of initial proposals for Pythia-6. 9B. Importantly, this initial “guessing” is not static; over 80% of these early layer predictions are refined into contextually accurate generations by the final layer, demonstrating a dynamic refinement process. Further analysis uncovered that LLMs dynamically adjust their computational depth based on task complexity. In part-of-speech analysis, function words are predicted earliest, while more complex tokens like nouns and verbs require deeper processing.

When recalling multi-token facts, the first token of the answer demands the most computational depth, while subsequent tokens appear at shallower depths. For multiple-choice questions, the model identifies valid options within the first half of the layers, finalizing its response only towards the end of the process. These findings demonstrate “Complexity-Aware Depth Use,” where easier tasks require fewer layers of computation, and more complex predictions necessitate deeper processing.

Iterative Refinement Drives Language Model Prediction

This research demonstrates that large language models do not utilize their full depth uniformly, instead employing a “guess-then-refine” strategy during prediction. By tracing intermediate representations during inference, scientists revealed that early layers initially propose predictions based on high-frequency tokens, effectively making statistical guesses due to limited contextual information. Subsequent layers then refine these initial proposals, with over 70% of tokens undergoing refinement as more contextual information becomes available. This process indicates that accurate token prediction is not achieved immediately, but rather through iterative improvement across layers.

Further analysis across part-of-speech tagging, fact recall, and multiple-choice tasks revealed nuanced depth usage. Function words are predicted earliest, the first token in a multi-token answer requires more computational depth than subsequent tokens, and the format of a multiple-choice response is identified early, with the final answer determined later. These findings provide a detailed view of how depth is intelligently utilized within language models, shedding light on the layer-by-layer computations that drive successful predictions. The authors acknowledge that their results rely on the TunedLens probe, and they conducted tests to demonstrate that the observed dominance of high-frequency tokens in early layers reflects the model’s internal representations, rather than a bias introduced by the probe itself. Future work could build on this understanding to improve computational efficiency in transformer-based models.

👉 More information
🗞 How Do LLMs Use Their Depth?
🧠 ArXiv: https://arxiv.org/abs/2510.18871

Tags:

fact recall task guess-then-refine framework high-frequency tokens intermediate representations Large Language Models layer-wise prediction dynamics multiple-choice task part-of-speech analysis Transformer-based Models

Large Language Models Use Depth with ‘Guess-then-Refine’, Refining Initial Predictions in 70% of Layers

Decoding LLM Representations with TunedLens Probing

Layer Dynamics Reveal Guess-then-Refine Prediction

LLMs Employ Guess-then-Refine Prediction Framework

Iterative Refinement Drives Language Model Prediction

Rohail T.

Latest Posts by Rohail T.:

Protected: Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently

Protected: Quantum Computing Tackles Fluid Dynamics with a New, Flexible Algorithm

Protected: Silicon Unlocks Potential for Long-Distance Quantum Communication Networks