Gemma 2 Models Reveal Dichotomy in Representational Straightening Within Context

Researchers are increasingly focused on understanding how Large Language Models (LLMs) process information internally, and a new study led by Eghbal A. Hosseini, Yuxuan Li, and Yasaman Bahri from Google DeepMind, alongside Declan Campbell from Princeton Neuroscience Institute and Andrew Kyle Lampinen from Google DeepMind, sheds light on the representational geometry within these models during in-context learning. The team investigated whether LLMs organise representations into straighter neural pathways as context increases, a phenomenon thought to improve prediction accuracy. Their findings reveal a surprising dichotomy: while representations do become straighter with more context in continual prediction tasks, leading to better performance, this ‘straightening’ is inconsistent in structured prediction tasks, occurring only when explicit patterns are present. This suggests in-context learning isn’t a single process, but rather a dynamic adaptation where LLMs select strategies based on task structure, behaving more like a versatile tool than a uniform system.

Representational Straightening Dynamics Reveal Differences in In-Context Learning across models

Scientists have demonstrated that large language models (LLMs) organise input sequence representations into straighter neural trajectories within their deep layers, a process theorised to facilitate next-token prediction through linear extrapolation. The team achieved a breakthrough by investigating whether this representational straightening occurs within a context during in-context learning (ICL).
Researchers measured representational straightening in Gemma 2 models across a diverse set of ICL tasks, revealing a dichotomy in how LLM representations change. Experiments show that in continual prediction settings, such as natural language and grid world traversal tasks, increasing context leads to increased straightness of neural sequence trajectories, correlating with improved model prediction performance.

Conversely, in structured prediction settings like few-shot tasks, straightening is inconsistent, appearing only during phases with explicit structure, for example, repeating a template, but disappearing otherwise. This work suggests that ICL is not a single, monolithic process, but rather a dynamic selection of strategies.

The study establishes that LLMs function like a Swiss Army knife, dynamically selecting between strategies depending on task structure, with only some strategies yielding representational straightening. Researchers evaluated a spectrum of data structures, including natural language, grid worlds with controlled latent structure, and few-shot learning benchmarks, to investigate how different types of context influence representational geometry.

Analysis reveals that representational geometry is task-dependent, with consistent straightening observed in natural language and structured grid tasks, indicating a flattening of representational manifolds. However, a dissociation was found in few-shot learning and question-answering tasks, where changes in straightening did not correlate with task performance.

This suggests LLMs do not rely on a universal mechanism for all forms of context, but instead employ a toolkit of distinct mechanisms appropriate to the task at hand. The research interrogates the evolution of representational structure during ICL, building upon previous work modelling sequence encoding as a neural trajectory within representation space, and drawing parallels to similar compression phenomena observed in biological systems.

Representational Straightening Across Diverse In-Context Learning Tasks emerges as a key generalization ability

Scientists investigated how large language models (LLMs) organise information during in-context learning (ICL). The research team employed Gemma 2 models and examined representational straightening, a phenomenon where neural trajectories become more linear, within the context window during ICL. They designed experiments across three task classes: natural language with long-range dependencies, grid world traversal for latent structure inference, and few-shot learning with semantic and algorithmic reasoning.

To probe long-range dependencies, the study utilised the LAMBADA dataset, focusing on narrative passages where predicting the final word requires broad contextual understanding. Researchers generated sequences up to 1024 tokens long, assessing how context shapes representational geometry for next-token prediction.

Grid world tasks were constructed at two levels of abstraction; a direct mapping with 36 nodes each assigned a unique English word, and a hierarchical mapping with 16 latent nodes associated with four semantically similar “child” observation words. Sequences for the grid world tasks reached lengths of up to 2048 tokens, testing the model’s ability to infer underlying graph structures.

Evaluation involved constructing a test set of 5-token walks inserted into the context window under three conditions: short context (positions 5-64), long context (final 64 tokens), and a 0-shot context condition for the hierarchical grid world, excluding specific transitions to force latent structure inference. For each task and condition, 200 unique sequences were generated.

Furthermore, the team selected 100 8-shot examples from existing datasets for the few-shot learning task, formatted as question-answer pairs, and extracted model representations to analyse performance. This multifaceted approach enabled the researchers to uncover a dichotomy in how LLMs’ representations change in context, revealing that ICL is not a monolithic process.

Representational dynamics differentiate continual and structured in-context learning, revealing distinct mechanisms for adaptation

Scientists investigated how large language models (LLMs) reorganise their internal representations during in-context learning (ICL). The research team measured representational straightening in Gemma 2 models across a diverse set of tasks, revealing a dichotomy in how LLM representations change with context.

In continual prediction settings, such as natural language and grid world traversal, increasing context demonstrably increased the straightness of neural sequence trajectories. This increase in straightness correlated with improvements in the model’s predictive performance, suggesting a more efficient encoding of information.

Experiments revealed that in structured prediction settings, like few-shot tasks, representational straightening was inconsistent. Straightening was only present during phases of the task exhibiting explicit structure, such as repeating a template, but disappeared when that structure was absent. Data shows that this suggests ICL is not a single, uniform process, but rather a dynamic selection of strategies.

Researchers observed that the degree of straightening in neural trajectories varied significantly depending on the task’s inherent structure. Measurements confirm that in natural language tasks utilising the LAMBADA dataset, increased context consistently led to straighter trajectories within the model’s middle layers.

Synthetic grid world tasks also exhibited this pattern, with longer contexts promoting representational straightening and improved performance. Conversely, tests prove that few-shot learning and question-answering tasks did not consistently demonstrate this effect. The team recorded that the absence of consistent straightening in these tasks suggests LLMs employ a flexible toolkit of computational mechanisms. This work proposes that LLMs function like a Swiss Army knife, dynamically selecting the most appropriate strategy based on task structure.

Representational dynamics differentiate continual and structured in-context learning, revealing distinct mechanisms for adaptation

Scientists have demonstrated that large language models (LLMs) exhibit differing representational strategies during in-context learning (ICL), depending on the nature of the task. Researchers measured representational straightening, the tendency of neural trajectories to become more linear, within the context of various ICL tasks using Gemma 2 models.

Their work reveals a dichotomy in how LLMs process information; in continual prediction tasks, such as natural language processing and grid world navigation, increasing context leads to increased representational straightening, correlating with improved predictive performance. Conversely, in tasks with structured prediction requirements, representational straightening is inconsistent, appearing only during phases involving explicit structure like template repetition, but disappearing when structure is absent.

This suggests that ICL is not a uniform process, but rather LLMs employ a flexible “toolkit” of strategies, dynamically selecting the most appropriate approach based on task demands. The findings caution against seeking a single, generalized notion of representational structure underlying ICL, proposing instead that models maintain flexibility through a library of computational strategies reflected in diverse representational signatures.

The authors acknowledge limitations including a focus on specific geometrical measures and a lack of causal interventions to confirm the relationship between straightening and behaviour. Future research should explore other geometrical and topological features, and develop methods for manipulating these features without disrupting the model’s natural data manifold. Expanding the evaluation to a wider range of model architectures and scales is also recommended to refine these observations.

👉 More information
🗞 Context Structure Reshapes the Representational Geometry of Language Models
🧠 ArXiv: https://arxiv.org/abs/2601.22364

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Shows 14% Gains with MIFOMO for Cross-Domain HSI Classification

Shows 14% Gains with MIFOMO for Cross-Domain HSI Classification

February 4, 2026
Shows Peft-Muts Improves Remaining Useful Life Prediction with Cross-Domain Time Series Data

Shows Peft-Muts Improves Remaining Useful Life Prediction with Cross-Domain Time Series Data

February 4, 2026
Shows 94.3% Accuracy: CSAE Boosts Movement Classification from sEMG Signals

Shows 94.3% Accuracy: CSAE Boosts Movement Classification from sEMG Signals

February 4, 2026