Large language models demonstrate impressive capabilities, but their size and complexity often limit widespread deployment, creating a need for efficient distillation techniques. Tianzhu Ye, Li Dong, and Zewen Chi from Microsoft Research, alongside Xun Wu, Shaohan Huang, and Furu Wei, address this challenge with a novel approach called Generative Adversarial Distillation. Their method trains a smaller ‘student’ language model by learning to mimic the responses of a powerful ‘teacher’ model, even without access to the teacher’s internal workings. The team frames this learning process as a competitive game between the student and a ‘discriminator’, which provides adaptive feedback, and the results demonstrate that this technique consistently outperforms existing distillation methods. Notably, a student model trained with Generative Adversarial Distillation achieves performance comparable to its much larger teacher, GPT-5-Chat, establishing a significant advance in the field of efficient language model development.
This work addresses the challenge of compressing powerful models, like GPT-5-Chat and Qwen2. 5, into more manageable sizes, such as those found in the Llama family. GAD focuses on aligning the way the student model updates its parameters with those of the larger teacher model, enabling more effective knowledge transfer. The team compared GAD to standard sequence knowledge distillation (SeqKD), a common method for compressing LLMs.
Results demonstrate that GAD consistently outperforms SeqKD and pre-distilled models across various datasets and model sizes, while maintaining more natural response lengths, unlike SeqKD which often produces shorter, less informative responses. This suggests that GAD preserves more of the teacher model’s nuanced language generation capabilities. This addresses the challenge of transferring capabilities from proprietary teacher models, like GPT-5-Chat, to smaller, open-source student models without access to internal parameters. GAD frames the student LLM as a generator and trains a discriminator to differentiate its responses from those of the teacher, establishing a competitive learning process. The team engineered this system by optimizing the student to produce responses that the discriminator cannot distinguish from the teacher’s, effectively providing implicit feedback on generation quality.
This adversarial process allows the student to learn even without explicit supervision, a significant advancement over traditional methods, and functions as an adaptive reward model dynamically adjusting to the student’s behavior throughout training. Experiments employed GPT-5-Chat as the teacher and models from the Qwen2. 5 and Llama3 families as students, consistently demonstrating GAD’s superiority over baseline instruction models and standard knowledge distillation. Notably, a Qwen2. This breakthrough enables effective learning in scenarios where conventional methods struggle. The team framed the student LLM as a generator and trained a discriminator to differentiate its responses from those of the teacher, establishing a competitive learning process. Experiments demonstrate that GAD consistently surpasses standard knowledge distillation techniques, with a Qwen2. 5 model trained with GAD achieving performance comparable to its teacher, GPT-5-Chat, as measured on a standard LLM evaluation benchmark.
The team validated the approach using GPT-5-Chat as the teacher and models from the Qwen2. 5 family as students. GAD frames the distillation process as a competitive game between a student model, acting as a generator, and a discriminator that learns to distinguish the student’s responses from those of the teacher. Experimental results demonstrate that GAD consistently outperforms standard knowledge distillation techniques across multiple language model families and datasets, with a student model trained with GAD achieving performance comparable to its proprietary teacher model. This work establishes GAD as a robust and effective solution for distilling LLMs, offering a promising pathway for advancing the field and democratizing access to powerful language technologies.
👉 More information
🗞 Black-Box On-Policy Distillation of Large Language Models
🧠 ArXiv: https://arxiv.org/abs/2511.10643
