Scientists are addressing the challenge of mechanistic interpretability in billion-parameter language models (LLMs), aiming to understand how these models compute internally. Mohammed Mudassir Uddin, Shahnawaz Alam, and Mohammed Kaif Pasha, from Muffakham Jah College of Engineering and Technology, introduce Hierarchical Attribution Graph Decomposition (HAGD), a novel framework that efficiently extracts sparse computational circuits from LLMs. HAGD drastically reduces computational complexity compared to exhaustive search methods, enabling scalable analysis of models ranging from 117 million to 70 billion parameters.
Experiments show that HAGD achieves up to 91% behavioural preservation (±2.3%) on modular arithmetic tasks, while maintaining interpretable subgraph sizes. The team validated the circuits through causal intervention protocols, confirming that these subgraphs represent genuine computational components rather than correlational artifacts. Cross-architecture analyses revealed that extracted circuits share moderate structural similarity (≈67%) across model families such as GPT-2, Llama, and Pythia, suggesting the existence of shared computational patterns and potential universal principles of neural computation.
HAGD establishes necessity and sufficiency criteria for circuit verification against behavioural benchmarks, providing a robust methodology for interpreting large-scale LLMs. This work lays the foundation for future advances in AI interpretability, highlighting both the potential and current limitations of mechanistic approaches for understanding complex language models.
👉 More information
🗞 Hierarchical Sparse Circuit Extraction from Billion-Parameter Language Models through Scalable Attribution Graph Decomposition
🧠 ArXiv: https://arxiv.org/abs/2601.12879
