Recent advances in multimodal models demonstrate remarkable potential for creating images from text, often employing chain-of-thought reasoning to improve results, but these methods typically treat image generation as separate from the planning process or rely on abstract textual descriptions. Dongzhi Jiang, Renrui Zhang, and Haodong Li, alongside colleagues, now present a new approach called Draft-as-CoT, or DraCo, which fully integrates textual and visual information during reasoning. The team’s method initially generates a low-resolution draft image, providing a concrete visual plan, and then uses the model’s understanding to identify and correct any inconsistencies between the draft and the original text prompt, ultimately refining the image with increased detail. This innovative technique overcomes the challenges of imprecise textual planning and generating images with unusual or complex features, achieving substantial improvements on established benchmarks like GenEval, Imagine-Bench, and GenEval++, and representing a significant step forward in text-to-image generation.
The core idea is to combine reasoning and generation by first creating a low-resolution draft image, then verifying its consistency with the text prompt, and finally refining the draft into a high-resolution final image. This approach allows for more controlled generation, enabling the model to plan the image before committing to a final output, much like sketching a rough draft before creating a detailed painting. The study includes details on the dataset used for training and the results achieved.
To support this research, the team created the DraCo-240K dataset, comprising 240,000 image-text pairs focusing on general correction, instance manipulation, and layout reorganization. The model builds upon Bagel, a large language model with image generation capabilities, and was trained for 14,000 steps utilizing exponential moving average weighting. Training included images of both 384×384 and 1024×1024 resolution to enhance generation capabilities at different scales. The results demonstrate that DraCo generates high-quality images with accurate alignment to the text prompts, exhibiting improved image quality and reduced artifacts compared to other methods. The team acknowledges limitations, including the potential unsuitability of the low-resolution draft for all media types and the computational cost of generating the draft, and plans to explore applying DraCo to other media types and developing more efficient draft generation techniques.
Draft-as-CoT for Enhanced Image Generation
This study introduces DraCo, a novel interleaved reasoning paradigm designed to enhance text-to-image generation within unified multimodal large language models. Unlike existing methods, DraCo fully leverages both textual and visual content throughout the reasoning process. The core of DraCo involves initially generating a low-resolution draft image, serving as a concrete visual plan and providing structural guidance for subsequent refinement. This draft is not merely a preliminary image, but an integral component of the reasoning loop, enabling detailed and nuanced planning. This verification step is crucial, identifying discrepancies and guiding selective corrections to the draft.
The team achieves refinement through super-resolution techniques, enhancing the draft image while simultaneously addressing identified semantic errors. To support training and evaluation, researchers curated the DraCo-240K dataset, specifically designed to enhance general correction, instance manipulation, and layout reorganization. Experiments demonstrate that DraCo achieves substantial performance gains, increasing scores on GenEval by 8%, Imagine-Bench by 0. 91, and GenEval++ by 3%, significantly outperforming both direct generation methods and other Chain-of-Thought-empowered approaches. This innovative methodology establishes a new benchmark for text-to-image generation, demonstrating the power of integrating visual and textual reasoning.
Draft Images Guide Improved Generation Accuracy
Recent advances in unified multimodal large language models demonstrate impressive capabilities in text-to-image generation, particularly when employing chain-of-thought reasoning. However, existing methods often treat image generation as a standalone process or rely on abstract textual planning, limiting their effectiveness. Researchers have introduced Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual information for improved planning and verification. This draft serves as a structural foundation for subsequent refinement. The team then utilizes the model’s inherent understanding capabilities to verify potential semantic misalignments between the draft and the original input prompt.
Any discrepancies are addressed through selective corrections and super-resolution techniques, resulting in a final, high-quality image. To support training and evaluation, the researchers curated DraCo-240K, a comprehensive dataset designed to enhance general correction, instance manipulation, and layout reorganization. Furthermore, they developed DraCo-CFG, a specialized classifier-free guidance strategy tailored for interleaved reasoning. Experiments demonstrate that DraCo achieves a remarkable increase in performance on several benchmarks, including a +8% improvement on GenEval, a +0. 91 increase on Imagine-Bench, and a +3% gain on GenEval++. DraCo significantly outperforms both direct generation methods and other approaches utilizing chain-of-thought, highlighting its ability to generate more accurate and visually compelling images.
DraCo Improves Image Generation with Interleaved Reasoning
This work introduces DraCo, a new interleaved reasoning paradigm for text-to-image generation. DraCo fully leverages both textual and visual information for improved planning and verification by first generating a low-resolution draft image, serving as a concrete visual plan and providing structural guidance for subsequent refinement. The team then utilizes the model’s inherent understanding capabilities to verify potential semantic misalignments between the draft and the original input prompt, addressing any discrepancies through selective corrections and super-resolution techniques. To support training and evaluation, the researchers curated DraCo-240K, a comprehensive dataset designed to enhance general correction, instance manipulation, and layout reorganization. Experiments demonstrate that DraCo achieves a remarkable increase in performance on several benchmarks, including improvements on GenEval, Imagine-Bench, and GenEval++. DraCo significantly outperforms both direct generation methods and other approaches utilizing chain-of-thought, highlighting its ability to generate more accurate and visually compelling images.
👉 More information
🗞 DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
🧠 ArXiv: https://arxiv.org/abs/2512.05112
