AI Now Generates Sketches Stroke by Stroke

Sketching, a fundamentally sequential activity involving purposeful stroke order and refinement, has largely been treated as a static image generation problem by current methods. Hui Ren and Alexander Schwing from UIUC, working with Yuval Alaluf and Omer Bar Tal from Runway, and Antonio Torralba and Yael Vinker from MIT, address this limitation by presenting VideoSketcher, a data-efficient approach for generating sequential sketches. Their research adapts pretrained text-to-video diffusion models, recognising the complementary strengths of large language models for semantic planning and stroke ordering, alongside the rendering capabilities of video diffusion for temporally coherent visuals. By representing sketches as progressive videos and employing a novel two-stage fine-tuning strategy, the team demonstrates high-quality sequential sketch generation from minimal human-authored data, as few as seven sketching processes, and opens avenues for interactive and controllable drawing experiences.

Scientists have developed a technique for generating sketches that mimics the step-by-step process humans use to create drawings. This offers a new avenue for intuitive human-computer interaction and could revolutionise design workflows. By combining language models with video generation technology, the system produces surprisingly detailed and ordered sketches from simple text prompts.

A two-stage fine-tuning strategy was central to the approach, first establishing a foundational understanding of shape composition and stroke order using synthetic data, then refining visual appearance with a minimal dataset of human-authored sketches. The most striking aspect of this work is the limited amount of human data required; the model successfully learned to generate high-quality sketches from as few as seven manually authored sketching processes, demonstrating remarkable data efficiency.

This achievement bypasses the need for extensive datasets, a common limitation in machine learning applications. Researchers can now create systems that mimic the creative process with far less reliance on large-scale data collection. The system represents a sketch as a video, where strokes appear one after another on a blank canvas, guided by instructions about the desired drawing order.

For years, computational models have strived to replicate the human drawing process, aiming for richer interaction in areas like visual brainstorming and collaborative design. Previous methods often struggled with both visual quality and maintaining a logical sequence of strokes. SketchRNN, for example, required millions of examples and was limited to specific object categories.

More recently, SketchAgent leveraged large language models but produced simplistic sketches lacking visual detail. This new approach addresses these shortcomings by using video diffusion models, trained on extensive video data, to provide strong visual priors. The team recognised that language models excel at planning what to draw and in what order, but struggle with the visual execution.

Therefore, they decoupled the learning of stroke ordering from the learning of visual appearance. Synthetic shape compositions, designed with controlled temporal structure, were used to teach the model fundamental drawing principles. Once this foundation was established, the model was then refined using only seven examples of human-drawn sketches, capturing both the overall drawing order and the continuous formation of individual strokes.

The result is a system capable of generating detailed, temporally coherent sketches that closely follow text-based instructions. Extensions such as brush style conditioning and autoregressive sketch generation further enhance controllability and enable interactive co-drawing experiences. By allowing users to specify brush styles, the model moves beyond simple stroke generation, offering a level of artistic control typically found in parametric stroke representations.

The research demonstrates the potential for collaborative drawing scenarios, where a human and the model can work together in real-time. These capabilities suggest a future where machines can not only generate images but also participate in the creative process itself.

Temporal dynamics underpin realistic sketch synthesis through video diffusion modelling

Initially, pretrained text-to-video diffusion models formed the basis of this work, adapting their capabilities for sequential sketch generation. These models, trained on extensive video datasets, inherently possess knowledge of visual appearance, motion, and temporal consistency, qualities essential for realistic sketch creation. The research team represented each sketch as a short video, progressively building the image with each stroke on a blank canvas.

This representation allowed the system to capture the temporal dynamics inherent in the drawing process, something often overlooked in previous generative approaches. Subsequently, a two-stage fine-tuning strategy was implemented to separate the learning of stroke ordering from the learning of visual appearance. First, stroke ordering was refined using synthetically generated shape compositions, providing controlled temporal structure for training.

These synthetic data allowed the model to learn the logic of drawing sequences without being limited by the diversity of real-world sketches. Learning visual appearance required different data, so the team distilled this information from a limited set of manually authored sketching processes. These seven examples, each capturing both the overall drawing order and the continuous formation of individual strokes, provided the model with crucial visual cues.

By decoupling these two learning stages, the research circumvented the need for vast amounts of human-drawn sketch data, a common limitation in this field. At this stage, the model began to generate sketches that not only followed the specified stroke order but also exhibited rich visual detail. Large language models (LLMs) played a key role in semantic planning, determining what to draw and the sequence of strokes.

Recognising that LLMs are not ideal visual renderers, the team paired them with the video diffusion models, which excel at producing high-quality, temporally coherent visuals. For instance, the LLM might decide to first draw the outline of a face, then the eyes, and finally the mouth, while the video diffusion model would handle the actual rendering of each stroke with appropriate visual fidelity.

High-fidelity sketch synthesis from minimal data via language-conditioned video diffusion

Researchers developed a novel method for sequential sketch generation, achieving high-quality results from a remarkably small dataset of human-authored sketches. The core achievement lies in the model’s ability to learn from as few as seven manually created sketching processes, a figure that underscores exceptional data efficiency.

The limited training data does not compromise visual detail. The generated sketches closely adhere to text-specified stroke orderings, demonstrating a level of control previously unseen with such constrained resources. Furthermore, the system supports brush style conditioning, allowing users to influence stroke appearance with simple visual cues. This capability extends pixel-based generation to include brush-level control, a feature typically found in parametric stroke representations.

At a technical level, the two-stage fine-tuning strategy effectively separates learning stroke ordering from learning sketch appearance, contributing to the model’s overall success. Considering the implications of this data efficiency, the research opens possibilities for personalized sketching tools. Since only seven examples are needed to adapt the model, users could conceivably train it on their own sketching style.

Beyond personalization, the approach facilitates interactive scenarios such as collaborative co-drawing, where multiple users contribute to a single sketch in real-time. The model’s ability to generate temporally coherent visuals is particularly impressive, given the challenges of sequential generation. The system’s performance suggests pretrained video diffusion models offer a powerful and flexible foundation for modelling drawing processes, moving beyond reliance on large-scale sketch datasets.

Inside this framework, the research also demonstrates the versatility of video diffusion models. For instance, the model can be extended to support autoregressive sketch generation, enabling additional controllability and interactive experiences. By leveraging the strengths of both language and video models, this work presents a new perspective on sequential sketch generation, one that prioritizes data efficiency and creative control.

Learning to sketch realistically with minimal data through process-based generative modelling

Scientists have long sought to replicate the fluidity of human creativity, yet artificial systems often struggle with the subtle dance between planning and execution that defines drawing. This new work sidesteps the need for vast datasets by cleverly combining existing technologies, large language models and video diffusion models, to generate sketches that mimic the sequential nature of human artistic process.

Rather than treating a sketch as a single image, the system understands it as a series of strokes unfolding over time, a distinction that unlocks a surprising level of control and realism. For years, generative models have excelled at producing static images, but capturing the process of creation proved far more difficult. This research demonstrates that a system can learn to sketch convincingly from a remarkably small number of examples, a figure of just seven authored sketching processes.

That limited data requirement is a departure from the typical hunger of machine learning algorithms for massive training sets, and it suggests a path toward more accessible and adaptable creative tools. Beyond simply recreating existing styles, the system can also respond to textual prompts, altering brush types and even continuing a sketch autonomously.

The true potential extends beyond mere imitation. Consider the implications for design, where rapid prototyping relies on quick visualisation of ideas. Or imagine assistive technologies that translate verbal descriptions into visual representations for individuals with communication difficulties. Further refinement is needed to address subtleties in line quality and stylistic variation.

At present, the system excels at following instructions, but true artistic expression demands a degree of unpredictability and originality that remains a challenge. Once this technology matures, we might see a shift in how digital art is created, moving away from pixel-by-pixel manipulation towards a more directed, process-oriented approach. Beyond art, the principles at play, combining semantic understanding with generative power, could find applications in robotics, where a robot might learn to assemble objects by observing a few demonstrations. Unlike many current AI systems, this work prioritises data efficiency, hinting at a future where complex skills can be learned with minimal supervision.

👉 More information
🗞 VideoSketcher: Video Models Prior Enable Versatile Sequential Sketch Generation
🧠 ArXiv: https://arxiv.org/abs/2602.15819

Rohail T.

Rohail T.

As a quantum scientist exploring the frontiers of physics and technology. My work focuses on uncovering how quantum mechanics, computing, and emerging technologies are transforming our understanding of reality. I share research-driven insights that make complex ideas in quantum science clear, engaging, and relevant to the modern world.

Latest Posts by Rohail T.:

Electric Fields Drive Two Distinct Material Phase Changes

Electric Fields Drive Two Distinct Material Phase Changes

February 19, 2026
Robotic Hands Gain Adaptable Designs for Varied Tasks

Robots Learn Skills from 20,854 Hours of Human Video

February 19, 2026
Llms Show 69% Cell Culture Success for Novices

Llms Show 69% Cell Culture Success for Novices

February 19, 2026