Kava: Latent Reasoning Via Compressed KV-Cache Distillation Achieves Efficient Knowledge Distillation from Teacher Models Via Self-distillation

Large language models demonstrate impressive abilities in complex reasoning, but their detailed step-by-step approach demands substantial computing power and memory, often generating unnecessarily lengthy explanations. Anna Kuzina from Qualcomm AI Research, Maciej Pioro from IDEAS NCBR / IPPT PAN, and Paul N. Whatmough from Qualcomm AI Research, along with their colleagues, introduce KaVa, a novel framework that tackles this challenge by efficiently transferring knowledge from a powerful ‘teacher’ model to a more streamlined ‘student’ model. The team distils this knowledge directly from a compressed memory of the teacher’s reasoning process, aligning the internal steps of both models using flexible, continuous representations. This innovative approach not only surpasses existing efficient reasoning methods, but also significantly reduces performance loss when moving from formal mathematical problems to natural language reasoning, and scales effectively to larger models, establishing compressed memory distillation as a powerful technique for combining accuracy and efficiency in artificial intelligence.

Latent Reasoning and Selective Step Removal

Scientists investigated how large language models solve mathematical problems, focusing on the internal reasoning processes within these models. The research explores techniques to improve a model’s ability to demonstrate reasoning that mirrors human thought processes. The team employed a method involving generating a chain of thought, a series of intermediate steps, and then selectively removing certain elements to refine its accuracy. Analysis of the model’s internal representations revealed how it focuses on key information during problem-solving. Researchers used cosine similarity to measure the alignment between the model’s internal key and value representations and the correct solution, both before and after the selective removal of information.

This analysis aimed to determine if the removal process helped the model prioritize relevant information and improve its accuracy. By comparing the model’s reasoning to that of human experts and the correct final answer, scientists gained insights into the model’s internal thought processes. This research opens a window into the “black box” of large language models, allowing scientists to understand how these models solve complex problems. By analyzing internal representations and attention weights, the team hopes to improve reasoning abilities and create more reliable artificial intelligence systems.

Knowledge Distillation via Compressed Key-Value Caches

Scientists developed KaVa, a novel framework that distills knowledge from a compressed key-value cache of a teacher model directly into a student model’s latent reasoning process. This work addresses limitations in current latent reasoning methods, which often struggle with a lack of supervision and reduced performance on complex, natural-language reasoning tasks. The team engineered a system where a teacher model generates a complete key-value cache by processing a full chain-of-thought sequence, capturing layer-wise and head-wise key-value representations. Subsequently, a module selectively compresses this cache to match the allocated resources for latent thoughts within the student model.

The core innovation lies in a key-value matching loss, which aligns the student’s per-step latent keys and values to the compressed target cache throughout the entire network. This process creates a strong, stepwise internal supervision signal, effectively teaching the student to “think like” a compact cache of its own explicit reasoning. Researchers implemented this by alternating the model between a teacher mode, building the key-value cache, and a student mode, generating continuous latent thoughts. Experiments demonstrate that knowledge can be successfully distilled from a compressed key-value cache, even though the compression process removes direct token correspondence.

The approach consistently outperforms strong latent reasoning baselines, exhibiting markedly smaller performance degradation when transitioning from equation-only to natural-language reasoning. Furthermore, the system scales effectively to larger models while retaining the efficiency benefits of latent inference. This research establishes compressed key-value cache distillation as a scalable supervision signal for latent reasoning, combining the accuracy of chain-of-thought trained teachers with the deployability of latent inference.

KaVa Improves Latent Reasoning with Key-Value Alignment

The research team introduces KaVa, a novel framework designed to improve latent reasoning in large language models. Their central achievement lies in demonstrating that compressed key-value caches from a teacher model can effectively supervise a student model’s latent reasoning process, even after losing direct correspondence to specific tokens. By aligning the student’s internal trajectory with the teacher’s reasoning dynamics within the key-value space, KaVa overcomes limitations associated with traditional token-level distillation and the computational demands of verbose chain-of-thought reasoning. Results consistently show KaVa outperforms existing latent reasoning baselines, scales effectively to larger models, and maintains robust performance on natural language reasoning tasks where previous methods often struggle. The team establishes compressed key-value cache distillation as a scalable and effective supervision technique, offering a pathway to develop efficient and powerful reasoning models.

Latent Reasoning Distilled From Compressed Key-Value Caches

Scientists have developed KaVa, a novel framework that successfully distills knowledge from a compressed key-value cache directly into a latent reasoning student, achieving a significant breakthrough in efficient artificial intelligence. This work demonstrates, for the first time, that valuable information can be extracted from a compressed key-value cache, even after the removal of direct token correspondence during the compression process. The team achieved this by leveraging the representational flexibility of continuous latent tokens to align stepwise key-value trajectories, effectively teaching the student model to “think like” a compact cache. Experiments reveal that KaVa consistently outperforms strong latent reasoning baselines, demonstrating a marked improvement in performance on natural language reasoning tasks.

Notably, the approach exhibits a substantially smaller performance degradation when transitioning from equation-only reasoning to more complex, natural-language traces, a challenge that has previously hindered other latent reasoning methods. This indicates a greater robustness and adaptability to real-world reasoning scenarios. The research team validated the framework’s scalability by successfully applying it to larger models, maintaining the efficiency benefits of latent inference. The core of the breakthrough lies in a three-component system: a model that alternates between teacher and student modes, a module that compresses the teacher cache, and a loss function that aligns the student’s latent key and value representations to the compressed target. This innovative approach provides a strong, stepwise internal supervision signal, enabling the student model to learn from the compressed key-value cache without relying on explicit token correspondence. Results demonstrate that KaVa effectively bridges the gap between template-like latent traces and natural-language reasoning, yielding strong gains on natural language datasets while preserving computational efficiency.

👉 More information
🗞 KaVa: Latent Reasoning via Compressed KV-Cache Distillation
🧠 ArXiv: https://arxiv.org/abs/2510.02312

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Effective Models Enable Physical Investigations, Resolving Vacuum Ambiguity in 4D Systems

Effective Models Enable Physical Investigations, Resolving Vacuum Ambiguity in 4D Systems

January 7, 2026
Digital Twins Benefit from Joint Parameter and State Estimation with Uncertainty Quantification

Digital Twins Benefit from Joint Parameter and State Estimation with Uncertainty Quantification

January 7, 2026
Movable Antenna Beam Coverage with Multi-Notch Filtering

Wireless Communication Advances with Multi-Notch Filter-Inspired Movable Antenna Beam Coverage

January 7, 2026