Researchers are tackling the critical problem of removing sensitive or copyrighted data from large language models (LLMs) without the need for complete retraining. Efstratios Zaradoukas, Bardh Prenkaj, and Gjergji Kasneci, all from the Technical University of Munich, present a new method called PURGE (Policy Unlearning through Relative Group Erasure) which frames unlearning as a verifiable task. This research is significant because current unlearning techniques frequently fail to fully erase data, compromise model performance, or rely on expensive external resources. PURGE utilises an intrinsic reward signal to penalise forbidden concepts, achieving up to 46% reduction in token usage, a 5.48% improvement in fluency, and a 12.02% boost in adversarial robustness, alongside 11% unlearning effectiveness on the RWKU benchmark while maintaining 98% of original utility.
This breakthrough tackles a growing problem as LLMs inadvertently memorise data during pretraining, creating compliance issues under regulations like the GDPR and the EU AI Act. The research team achieved this by framing unlearning as a verifiable problem, utilising the Group Relative Policy Optimization framework and an intrinsic reward signal that penalises the mention of forbidden concepts, enabling safe and consistent data removal. PURGE innovates by moving beyond existing unlearning approaches that often leak data, compromise model performance, or rely on expensive external reward models.
The study establishes a novel Reinforcement learning approach where the successful removal of specific data is directly measurable, allowing the model to optimise forgetting in a manner similar to how it optimises reasoning tasks. Experiments demonstrate that PURGE reduces token usage per target by up to a factor of 46 compared with state-of-the-art methods, simultaneously improving fluency by 5.48 percent and adversarial robustness by 12.02 percent over the original model. The core contribution of this work lies in its principled framework, treating unlearning as a verifiable task and leveraging Group Relative Policy Optimization to guide LLMs in forgetting specific knowledge while preserving general utility. Theoretical results prove geometric decay of forbidden-token probabilities and provide high-probability bounds on utility retention via KL divergence, offering formal guarantees of the method’s effectiveness.
This approach is not only more reliable but also more scalable and cost-effective for real-world deployments, as it requires no external reward models. Extensive evaluation on the Real World Knowledge Unlearning (RWKU) benchmark confirms that PURGE achieves 11 percent unlearning effectiveness while maintaining 98 percent of the original model’s utility. This demonstrates a significant advancement in the field, offering a solution that balances the need for data removal with the preservation of overall model performance. The research suggests a promising new direction for unlearning research, combining theoretical guarantees, improved safety, and practical deployment efficiency, paving the way for LLMs that can comply with data privacy regulations and ethical guidelines.
Scientists Method
Scientists developed PURGE, a novel method for unlearning data from large language models (LLMs) without complete retraining, addressing critical compliance challenges posed by regulations like GDPR and the EU AI Act. The research team formulated unlearning as a verifiable problem, grounding their approach in the Group Relative Policy Optimization framework. PURGE employs an intrinsic reward signal that actively penalises any model output mentioning forbidden concepts, facilitating safe and consistent data erasure. This innovative technique circumvents the limitations of existing methods which often leak data, compromise fluency, or rely on computationally expensive external reward models.
Experiments involved training the LLM to minimise the mention of targeted, sensitive information, quantified through a specifically designed reward function. The study pioneered the use of this intrinsic reward, directly measuring the absence of forbidden concepts in generated text, thereby enabling verifiable unlearning. Researchers harnessed the GRPO framework to guide the model’s policy, effectively steering it away from recalling erased data. This process involved iteratively refining the model’s parameters based on the reward signal, encouraging forgetting while preserving general language capabilities.
The team evaluated PURGE against state-of-the-art unlearning methods, demonstrating a reduction of up to 46 percent in token usage per target. Furthermore, the study revealed a 5.48 percent improvement in fluency and a 12.02 percent increase in adversarial robustness compared to the base model. Extensive evaluation on the Real World Knowledge Unlearning benchmark showed PURGE achieving 11 percent unlearning effectiveness, successfully removing targeted information while maintaining 98 percent of the model’s original utility. This demonstrates that framing LLM unlearning as a verifiable task enables more reliable, efficient, and scalable forgetting, offering a promising new direction for future research.
PURGE enhances LLM unlearning, fluency and robustness against
Scientists have developed PURGE, a new method for removing sensitive data from large language models (LLMs) without requiring complete retraining. This research addresses a critical need for compliance with data privacy regulations like GDPR and the EU Act, which demand the ability to erase information from deployed models. The team formulated unlearning as a verifiable problem, utilising the Group Relative Policy Optimization framework to achieve safe and consistent data erasure. Experiments revealed that PURGE reduces token usage per target by up to a factor of 46 compared to state-of-the-art methods, significantly improving efficiency.
Results demonstrate a 5.48 percent improvement in fluency and a 12.02 percent increase in adversarial robustness over the base model, indicating that PURGE not only removes data but also maintains, and even enhances, model performance. The core of PURGE lies in its intrinsic reward signal, which penalises any mention of forbidden concepts, guiding the model to effectively forget specific information. Measurements confirm that this approach enables more reliable, efficient, and scalable forgetting, offering a promising new direction for unlearning research. Specifically, the work introduces a principled unlearning framework treating the process as a verifiable task, unlike prior methods that attempt direct data removal.
The study achieved 11 percent unlearning effectiveness on the Real World Knowledge Unlearning (RWKU) benchmark, while preserving 98 percent of the original utility, a substantial accomplishment in balancing data removal and model functionality. Scientists formally proved geometric decay of forbidden-token probabilities and established high-probability bounds on utility retention using KL divergence, providing theoretical guarantees for the suppression of targeted knowledge. Tests prove that PURGE achieves competitive unlearning performance with significantly fewer tokens, up to 46times fewer per forget target, and without relying on costly external reward models, making it a practical solution for real-world deployments. Furthermore, extensive experiments across knowledge memorisation, manipulation, adversarial robustness, and real-world utility tasks validate that PURGE delivers more natural and coherent outputs. Data shows that the method improves fluency by +5.48 percent compared to the original model and exhibits +12.02 percent greater resistance to adversarial attacks, ensuring safer and more reliable unlearning. The research establishes empirical constraints, defining empirical retention and generalisation with tolerances of εR and εG, serving as operational proxies for data retention and model performance.
👉 More information
🗞 Reinforcement Unlearning via Group Relative Policy Optimization
🧠 ArXiv: https://arxiv.org/abs/2601.20568
