Multi-subject Customization Achieves Layout Control, Preserving Identity Without Training

Creating images that convincingly combine multiple, user-defined subjects remains a significant challenge in image synthesis, often hampered by issues of conflicting elements or missing subjects. Binhe Yu, Zhen Wang, and Kexin Li, from their respective institutions, alongside Yuqian Yuan, Wenqiao Zhang, and Long Chen, now present a new framework called AnyMS that tackles this problem with a unique approach. AnyMS achieves coherent multi-subject image generation without requiring additional training, a major step towards greater scalability and efficiency. The team accomplishes this through a novel ‘bottom-up’ decoupling method, effectively separating and harmonising text prompts, subject images, and layout constraints during the image creation process, resulting in state-of-the-art performance and the ability to handle complex compositions with numerous subjects.

AnyMS, Multi-Subject Image Generation Details

AnyMS is a new image generation technique designed to improve the creation of images containing multiple subjects, ensuring accurate depiction, adherence to text prompts, and correct layout. This document serves as an appendix to the research paper introducing AnyMS, providing supporting evidence through visualizations, user study results, and implementation details. The appendix is divided into five sections, detailing visualizations of attention maps, a user study with 25 participants, additional generated images, quantitative evaluation setup, and a discussion of broader impacts. Attention map visualizations demonstrate how AnyMS separates the attention of each subject to its specific region, preventing feature entanglement and improving results.

The user study revealed a strong preference for AnyMS, with participants scoring it highest across concept alignment, text alignment, layout control, and overall quality. AnyMS is generalizable and robust, handling a large number of subjects and diverse scenarios, as shown through additional generated images. The evaluation setup is comprehensive, designed to test the system’s ability to handle complexity with 24 subjects and 11 combinations. The research acknowledges potential positive impacts in areas like advertising and film, while also emphasizing the need for responsible development to prevent harmful content and protect privacy.

Key technical concepts underpinning AnyMS include diffusion models, text-to-image generation, multi-subject image generation, and attention mechanisms. The system addresses the problem of feature entanglement, where models confuse attributes between subjects, and offers layout control, allowing specification of spatial arrangement. This information is valuable for researchers, developers, and anyone interested in the latest advancements in AI image generation.

Dual-Level Decoupling for Layout-Guided Image Customization

The AnyMS framework introduces a training-free approach to layout-guided multi-subject image customization, overcoming limitations of existing methods that often require additional training and struggle with text alignment, subject identity, and layout control. AnyMS utilizes text prompts, subject images, and layout constraints, employing a bottom-up dual-level decoupling mechanism to harmonize their integration during image generation. This separation of attention processes ensures accurate and coherent results, even with complex compositions and numerous subjects. Scientists engineered a global decoupling strategy to isolate cross-attention between textual and visual conditions, directly improving text alignment.

A local decoupling method confines each subject’s attention to its designated area, preventing conflicts and guaranteeing identity preservation and precise layout control. The team further advanced the field by using pre-trained image adapters, extracting subject-specific features and aligning them with the diffusion model, eliminating the need for subject learning or adapter tuning. Experiments demonstrate that AnyMS achieves state-of-the-art performance, supporting complex compositions and scaling to accommodate a larger number of subjects while maintaining a balance between layout control, text alignment, and identity preservation.

Layout-Guided Image Generation Without Training

AnyMS is a novel training-free framework for layout-guided multi-subject customization, achieving state-of-the-art performance in image generation. The system addresses challenges in synthesizing images with multiple user-specified subjects, particularly issues of missing or conflicting depictions, without requiring additional training. AnyMS leverages text prompts, subject images, and layout constraints, employing a bottom-up dual-level decoupling strategy to harmonize their integration during image creation, ensuring both text alignment and accurate subject representation. The team implemented global decoupling, separating the interplay between textual and visual conditions for precise text alignment, and local decoupling, confining each subject to its designated area to prevent conflicts and preserve identity. Experiments demonstrate that AnyMS successfully balances text alignment, subject identity preservation, and layout control, a feat previously difficult to achieve without extensive training. The use of pre-trained image adapters, extracting subject-specific features aligned with the diffusion model, eliminates the need for subject learning or adapter tuning, delivering improved performance across diverse benchmarks and supporting intricate scenes with an increasing number of subjects.

AnyMS, Layout and Identity in Generation

AnyMS represents a significant advance in image customization, offering a training-free framework capable of synthesizing images from multiple subjects guided by text prompts and layout constraints. Researchers achieved this by decoupling the integration of textual, visual, and layout information in a bottom-up, dual-level manner, effectively balancing text alignment, subject identity preservation, and adherence to the specified layout. This innovative approach allows for the creation of complex image compositions with a greater number of subjects than previously possible. The system achieves state-of-the-art performance in multi-subject image generation, maintaining high image fidelity and robustly scaling to more complex scenes. While acknowledging reliance on underlying pre-trained models, the researchers highlight its effectiveness with three or more subjects, and future work intends to extend this framework to video customization and explore methods for jointly controlling subject, action, and style within generated images.

👉 More information
🗞 AnyMS: Bottom-up Attention Decoupling for Layout-guided and Training-free Multi-subject Customization
🧠 ArXiv: https://arxiv.org/abs/2512.23537

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Provably Secure Generative AI Achieves Reduced Risk through Reliable Consensus Sampling

Provably Secure Generative AI Achieves Reduced Risk through Reliable Consensus Sampling

January 8, 2026
Superconducting Qubits Achieve Programmable Heisenberg Simulation with Five-Qubit Chains

Superconducting Qubits Achieve Programmable Heisenberg Simulation with Five-Qubit Chains

January 8, 2026
Advances in Quantum Systems Enable Analysis of Higgs Mode Damping in 2D Spin Models

Advances in Quantum Systems Enable Analysis of Higgs Mode Damping in 2D Spin Models

January 8, 2026