The increasing sophistication of artificial intelligence raises concerns about the potential for misuse, particularly in image generation where harmful content can be easily created. Zongsheng Cao, Yangfan He, and Anran Liu, from Researcher, along with Jun Xie, Feng Chen, and Zepeng Wang from Lenovo, address this challenge with PurifyGen, a new method for ensuring safe text-to-image generation. Their work moves beyond traditional safety filters that rely on blocked keywords or extensive retraining, instead offering a unique, training-free approach that assesses and modifies prompts before they are processed. By evaluating the semantic meaning of each word and selectively removing harmful associations while preserving the original intent, PurifyGen significantly reduces the generation of unsafe images, demonstrating superior performance across multiple datasets and offering a robust solution for responsible AI development.
Diffusion Models, Safety and Control Challenges
Text-to-image (T2I) diffusion models, such as DALL-E 3, Stable Diffusion, and Sora, present challenges regarding the generation of unsafe or undesirable content. These models are vulnerable to prompt engineering techniques that bypass safety filters, potentially producing harmful, biased, or inappropriate images. Current research addresses this problem through various approaches, aiming to create more responsible and aligned AI systems. Researchers are exploring prompt engineering and red-teaming, including adversarial attacks to identify vulnerabilities in safety mechanisms, compositional attacks that break down complex prompts, and systematic red-teaming efforts involving human testers.
Model modification and training techniques, such as unlearning, concept editing, and safety-driven training, are also being investigated to remove unsafe associations or minimize harmful content generation. Runtime control methods, like latent space manipulation and guardrails, offer further avenues for steering generation away from unsafe regions. Effective evaluation of T2I model safety is crucial, requiring robust metrics and red-teaming efforts. Many existing safety mechanisms are easily bypassed, unlearning concepts from large models is complex, and improving safety often involves trade-offs with image quality or creative freedom. Compositional attacks prove effective at circumventing filters, highlighting the need for continuous monitoring and adaptation as models evolve. This research represents a constant effort between those attempting to bypass safety measures and those developing new defenses.
Dual-Stage Prompt Purification for Safe Generation
To enhance safety in text-to-image (T2I) diffusion models, scientists developed PurifyGen, a training-free approach that preserves existing model weights while mitigating unsafe content. This work introduces a dual-stage prompt purification strategy, beginning with a detailed safety evaluation of individual tokens within a prompt. The team calculates a complementary semantic distance, quantifying each token’s proximity to both toxic and clean concept embeddings, enabling fine-grained risk assessment without keyword matching or additional training. Following risk assessment, PurifyGen employs a dual-space transformation to purify prompts by selectively modifying risky token embeddings.
Researchers project these embeddings into a null space, neutralizing harmful semantics, and simultaneously align them within the range space of clean concepts. This process actively subtracts unsafe associations while reinforcing positive ones, preserving the original prompt’s intent and coherence. Experiments across five datasets demonstrate PurifyGen’s superiority over existing methods, offering a plug-and-play solution with strong generalization capabilities.
PurifyGen, Prompt Purification for Safer Image Generation
PurifyGen is a novel, training-free approach to enhance safety in text-to-image (T2I) generation, preserving the capabilities of existing models. This work introduces a dual-stage strategy focused on prompt purification, addressing concerns about generating unsafe content from diffusion models. The method assesses the safety of individual tokens within a user’s prompt, moving beyond simple keyword blocking or extensive retraining. PurifyGen calculates a complementary semantic distance for each token, measuring its proximity to toxic and clean concepts, enabling a fine-grained evaluation of risk. Risky tokens then undergo a dual-space transformation, where harmful semantic components are projected into a null space and aligned with safe concepts to reinforce positive meaning. This selective, token-level replacement minimizes disruption to the prompt’s coherence and intent. Extensive experiments demonstrate state-of-the-art performance across multiple benchmarks, outperforming other training-free methods and achieving results competitive with fine-tuned models.
Prompt Purification For Safer Image Generation
PurifyGen represents a significant advance in safe text-to-image generation, achieving strong performance without modifying the underlying diffusion model or requiring additional training data. The researchers developed a two-stage prompt purification strategy, evaluating token safety through semantic distance and employing dual-space transformations to remove harmful semantic components while preserving the original intent. This innovative approach selectively refines token embeddings, minimizing disruption to safe content and ensuring high-quality image generation. The method consistently outperforms existing techniques in reducing unsafe content across multiple datasets, maintaining competitive generation speeds and offering broad compatibility with various diffusion model architectures. PurifyGen’s plug-and-play functionality provides a flexible and scalable option for real-world applications, addressing a critical need for responsible AI development. Future research will investigate extending the purification framework to multi-modal generation and interactive prompt refinement.
👉 More information
🗞 PurifyGen: A Risk-Discrimination and Semantic-Purification Model for Safe Text-to-Image Generation
🧠 ArXiv: https://arxiv.org/abs/2512.23546
