Large language models store knowledge in complex ways that make it difficult to understand or manipulate, but researchers are now developing methods to bring greater clarity to this process. Minglai Yang, Xinyu Guo, and Mihai Surdeanu from the University of Arizona, along with Liangming Pan from Peking University, present a new technique called AlignSAE that tackles this challenge by creating a more organised internal representation of knowledge. AlignSAE uses sparse autoencoders, a type of machine learning model, and trains them in two stages, first learning general patterns and then specifically linking these patterns to defined concepts. This approach results in a system where individual concepts are clearly separated, allowing researchers to precisely control and even swap these concepts within the model, opening up new possibilities for understanding and manipulating artificial intelligence.
Sparse Autoencoders Reveal LLM Representations
This research investigates how large language models represent knowledge internally, aiming to make these complex systems more understandable and controllable. Scientists employed Sparse Autoencoders to simplify and reveal underlying patterns in the internal activations of a pre-trained language model, focusing on the residual stream to capture key information. This involved a carefully designed loss function that encouraged accurate reconstruction of activations, sparsity in the encoded representation, and alignment of latent dimensions with semantic concepts. The goal was to create a latent space where each dimension corresponds to a distinct concept, enabling controlled manipulation of the model’s behavior.
Detailed analysis revealed that the benefits of this approach are most pronounced in the middle layers of the language model, where semantic concepts emerge most clearly. The Sparse Autoencoder successfully established a clear connection between ontological relations and the latent slots in these layers, resulting in more stable and interpretable features. Layer 6 was identified as the optimal layer for semantic intervention, balancing semantic representation and minimal artifacts, allowing for controlled manipulation of the model’s behavior with an 85% success rate. This approach offers improved interpretability of the language model’s internal representations and the ability to control its behavior through the aligned latent space. The detailed layer-wise analysis provides valuable insights into how the benefits of Sparse Autoencoder training vary across different layers. This work has potential implications for explainable AI, causal reasoning, AI safety, knowledge editing, and personalized AI, contributing to the development of more transparent and controllable artificial intelligence systems.
Aligning Sparse Autoencoders with LLM Activations
Scientists developed a new method, AlignSAE, to address limitations in interpreting hidden activations within large language models. Recognizing that LLMs encode knowledge in complex spaces, researchers employed Sparse Autoencoders to decompose these activations into more interpretable features. However, standard Sparse Autoencoders often struggle to reliably align these features with human-defined concepts. To overcome this, the team engineered a “pre-train, then post-train” curriculum for AlignSAE, initially training the Sparse Autoencoder on language model activations to establish a foundational sparse representation.
Subsequently, a supervised post-training phase was implemented, binding specific concepts to dedicated latent slots within the Sparse Autoencoder through a novel loss function that encourages isolation of concepts. The researchers carefully preserved the remaining capacity of the Sparse Autoencoder for general reconstruction, maintaining the model’s overall functionality. This creates an interpretable interface where specific relations can be inspected and controlled without interference from unrelated features. Experiments demonstrate that AlignSAE enables precise causal interventions, such as reliable “concept swaps”, by targeting single, semantically aligned slots. The team validated the method by analyzing normalized activations, demonstrating that concepts are cleanly isolated within dedicated features, facilitating both interpretation and steering of the language model’s behavior.
Disentangling Concepts in Large Language Models
Scientists developed a new method, AlignSAE, to address limitations in interpreting how large language models process information, specifically focusing on disentangling complex concepts within the models’ hidden layers. Researchers achieved this by building upon existing Sparse Autoencoders and introducing a “pre-train, then post-train” curriculum. Initial experiments involved training a large Sparse Autoencoder alongside a frozen base language model, first in an unsupervised manner to learn a general code, then with supervised concept supervision, designating specific latent feature slots for defined concepts while preserving a free feature bank for overall reconstruction. The team augmented the training objective with losses designed to bind and isolate each concept within its designated feature slot, yielding clean, isolated activations that are easy to find, interpret, and steer. Results demonstrate that AlignSAE enables precise causal interventions, such as reliable “concept swaps”, by targeting single, semantically aligned slots. The method successfully maps each concept to a dedicated feature, yielding interpretable activations and offering a pathway to applications requiring reliable feature-level control, including safety steering, knowledge editing, and data attribution.
Ontology Encoding Enables Controllable Language Models
This research presents a practical method for adding world knowledge to large language models by encoding ontological information into a frozen model’s mid-layer through a concept-aligned sparse autoencoder post-training interface. The team demonstrates that by training on verifiable reasoning traces, slot identity becomes linked to ontology relations, making answers predictable from those slots and creating an interface that is both explainable and controllable. Results across several small ontologies show robust relation binding even with paraphrased questions, reliable slot-level interventions that steer answers, and a clear emergence of one-to-one alignment in the middle layers of the model. This approach shifts the focus from analysing distributed representations to operating an addressable world-knowledge interface without altering the base language model’s weights, effectively bridging the gap between models lacking world knowledge and those with actionable structure. Named concept slots function as stable variables that could potentially control tools or inform decisions in agents and robots.
👉 More information
🗞 AlignSAE: Concept-Aligned Sparse Autoencoders
🧠 ArXiv: https://arxiv.org/abs/2512.02004
