Large language models increasingly demonstrate complex linguistic abilities, yet the precise stages at which these capabilities develop remain largely mysterious, as standard evaluation methods offer limited insight into the learning process. Deniz Bayazit from EPFL, Aaron Mueller from Boston University, and Antoine Bosselut from EPFL address this challenge by tracking the evolution of linguistic features throughout the pretraining of these models. Their work introduces a novel method using ‘sparse crosscoders’ to discover and align features across different stages of training, effectively revealing when specific concepts emerge, are maintained, or even disappear. This approach, which is applicable to various model architectures and scales effectively, offers a significant step towards understanding the inner workings of large language models and promises more interpretable analysis of representation learning.
Linguistic Features Emerge During Pre-training
This text details an analysis of the internal representations learned by the BLOOM language model at different stages of pre-training, ranging from 1 billion to 341 billion parameters. The analysis focuses on identifying the most important features and understanding how these features overlap across different languages, including English, French, Hindi, and Arabic, and linguistic tasks, such as subject-verb and gender agreement. The key findings demonstrate how the model’s understanding of language evolves over time. The analysis uses a metric called Relative Indirect Effects, or RELIE, to identify important internal features relating to tense, gender, number, and grammatical markers.
Generally, feature overlap increases as the model grows larger, indicating that larger models learn more shared representations. Languages with similar structures, such as English, French, Spanish, and Portuguese, exhibit higher feature overlap, suggesting the model leverages these similarities. Conversely, languages with different scripts and structures, like Arabic and Hindi, show less feature overlap. Arabic demonstrates better cross-lingual generalization than Hindi, potentially due to its greater presence in the training data, while Hindi features tend to remain language-specific. Early versions of the model learn more language-specific features, while later versions learn more shared, cross-lingual features, highlighting the emergence of shared representations as the model processes more data.
Tracking Concept Emergence in Language Models
Scientists have developed a novel methodology to track the emergence and consolidation of linguistic representations within large language models (LLMs) during pretraining. Recognizing that traditional benchmarking fails to reveal how models acquire concepts, the team pioneered the use of sparse crosscoders to discover and align features across different model checkpoints, effectively creating a timeline of learning. This approach moves beyond simply measuring performance to understanding the underlying mechanisms of concept internalization. The study employs sparse crosscoders, which learn a unified feature space across multiple checkpoints, allowing for direct comparison of representations at different stages of training.
Unlike previous methods requiring unique sparse autoencoders for each checkpoint, this shared feature space enables researchers to pinpoint features that are maintained, emerge, or vanish over time. The team trained these crosscoders on triplets of checkpoints exhibiting significant behavioral and representational shifts, focusing on capturing the evolution of syntactic concepts. To quantify the causal importance of individual features throughout training, scientists introduced a novel metric called Relative Indirect Effects (RELIE). This metric attributes feature relevance across checkpoints, providing a precise measure of when and how features contribute to task performance.
Through ablation and interpretability studies, researchers validated RELIE, confirming its ability to accurately trace the development of linguistic concepts. This combination of crosscoders and RELIE enables precise annotation of feature roles at individual checkpoints and their evolution over time. This architecture-agnostic framework, validated on Pythia, BLOOM, and OLMo, demonstrates scalability to billion-parameter models, offering a powerful tool for analyzing representation learning. Qualitative findings reveal that LLMs progressively build higher-level abstractions, transitioning from token and language-specific concepts to more universal representations, highlighting the dynamic nature of learning within these complex systems. The methodology provides a detailed, concept-level understanding of LLM pretraining, moving beyond performance metrics to reveal the underlying mechanisms of linguistic acquisition.
Tracing Linguistic Feature Development in LLMs
Researchers have developed a novel method to track the emergence of linguistic abilities within large language models (LLMs) during their pretraining phase. This approach utilizes sparse crosscoders, designed to discover and align features across different checkpoints of the model’s development, allowing scientists to trace how specific linguistic features evolve over time. The team successfully demonstrates that these crosscoders can detect when features are first learned, maintained throughout training, or ultimately discontinued. The method involves comparing model checkpoints and quantifying changes in cross-entropy loss when reconstructing mid-layer outputs using the crosscoders, a metric denoted as ∆CE.
Results show that less trained models, such as those with 1 billion parameters, exhibit smaller ∆CE values compared to more extensively trained models with 286 billion parameters, reflecting the increasing complexity of the learned representations. Analysis of these crosscoders reveals quantifiable metrics, including the average number of activated features (l0) and the number of features never activated (dead features), providing insights into the model’s internal workings. Experiments across various model sizes, including Pythia, OLMo, and BLOOM, demonstrate the effectiveness of this approach. For instance, comparisons between 128 million and 1 billion parameter models show distinct changes in ∆CE, while comparisons involving models up to 3048 billion parameters reveal increasingly complex feature dynamics.
Furthermore, the team introduced a metric called Relative Indirect Effects (RelIE) to pinpoint features causally important for task performance, demonstrating a strong correlation (average Spearman correlation of 0. 945 and 0. 952) with feature ablation studies, indicating that RelIE effectively identifies task-relevant features. This research offers a scalable and interpretable method for analyzing representation learning, paving the way for a more nuanced understanding of how LLMs acquire linguistic capabilities.
Tracking Linguistic Abilities During Pre-training
This research demonstrates a method for tracking the development of linguistic abilities within large language models during their pre-training phase. By employing crosscoders to analyze model checkpoints, the team successfully detected when specific features, ranging from basic token recognition to complex syntactic patterns, emerge, are maintained, or disappear. The findings reveal that monolingual models progress from identifying individual tokens to understanding broader grammatical structures, while multilingual models consolidate these into shared, cross-lingual representations. This approach proves scalable and applicable across different model architectures, offering a pathway towards more interpretable analysis of how these models learn.
👉 More information
🗞 Crosscoding Through Time: Tracking Emergence & Consolidation Of Linguistic Representations Throughout LLM Pretraining
🧠 ArXiv: https://arxiv.org/abs/2509.05291
