Scientists are tackling the limitations of current image generation techniques, which often rely on complex, multi-step processes and hidden ‘latent’ spaces. Yiyang Lu, Susie Lu, and Qiao Sun from MIT, along with Hanhong Zhao, Zhicheng Jiang, and Xianbang Wang et al., present a novel approach called “pixel MeanFlow” (pMF) that generates images in a single step, directly in pixel space. This research is significant because it bypasses the need for latent variables, simplifying the generation process and achieving state-of-the-art results on the challenging ImageNet dataset at both 256×256 and 512×512 resolutions, with FID scores of 2.22 and 2.48 respectively. Their work represents a crucial advancement towards more efficient and effective flow-based generative models.

This breakthrough addresses key limitations in modern diffusion and flow-based models, which typically rely on multi-step sampling and operation within a latent space. The research team formulated a unique strategy by decoupling the network output space from the loss space, designing the network to predict values on a presumed low-dimensional image manifold, termed x-prediction, while defining the loss through MeanFlow in velocity space. This innovative approach introduces a transformation linking the image manifold to the average velocity field, enabling more efficient and accurate image generation.

The core of pMF lies in its ability to approximate the average velocity field induced by the underlying Ordinary Differential Equation (ODE) trajectory. By defining a field, x(zt, r, t), representing denoised images, the researchers hypothesise that this field resides on a low-dimensional data manifold, making it more amenable to neural network approximation. Experiments confirm that this formulation aligns well with the manifold hypothesis, resulting in a more learnable target for the network. This allows pMF to directly map noisy inputs to image pixels, offering a “what-you-see-is-what-you-get” property absent in traditional multi-step or latent-based methods.

The study reveals strong performance on the ImageNet dataset, achieving a Fréchet Inception Distance (FID) of 2.22 at 256×256 resolution and 2.48 at 512×512 resolution. These results fill a critical gap in the field, demonstrating the feasibility of one-step latent-free generation at high resolutions. Furthermore, the researchers highlight the importance of a proper prediction target, showing that directly predicting a velocity field in pixel space leads to catastrophic performance. This underscores the effectiveness of the x-prediction strategy in guiding the network towards learning meaningful representations.

This work builds upon recent progress in both few-/one-step sampling, exemplified by Consistency Models and MeanFlow, and raw pixel space image generation, such as “Just image Transformers”. By merging these advancements, the team overcame the challenge of designing a unified network capable of simultaneously performing manifold learning and modelling complex trajectories. The resulting pMF not only advances the state-of-the-art in image generation but also paves the way for more efficient and direct generative modelling, potentially simplifying the architecture and training process of future models. The research establishes a solid step towards a single, end-to-end neural network for high-quality image synthesis.

Pixel MeanFlow for direct image generation

Scientists introduced pixel MeanFlow (pMF), a novel approach to one-step latent-free image generation, addressing limitations in existing flow-based models. The research team engineered a system that decouples the network output space from the loss space, targeting a presumed low-dimensional image manifold for network outputs while defining loss via MeanFlow in velocity space. This innovative formulation involved developing a transformation to connect the image manifold with the average velocity field, empirically demonstrating improved alignment with the manifold hypothesis and enhanced learnability. Experiments employed a Transformer-based network trained to directly map noisy inputs to image pixels, achieving a “what-you-see-is-what-you-get” property absent in multi-step or latent-based methods.

The study pioneered the use of perceptual loss, combined with the novel approach, to further refine generation quality. Researchers rigorously tested pMF on the ImageNet dataset, generating images at 256×256 resolution with a Fréchet Inception Distance (FID) of 2.22 and at 512×512 resolution with an FID of 2.48. To facilitate this decoupling, the team introduced a conversion relating the velocity fields, average velocity, and image data. This conversion allows the network to predict a velocity field in pixel space, a strategy shown to be critical, as direct prediction of velocity in pixel space resulted in catastrophic performance.

The experimental setup involved training the network to predict the average velocity field, leveraging the transformation to bridge the gap between the image and velocity spaces. This method achieves competitive results, demonstrating that one-step latent-free generation is both feasible and effective. The work reveals a solid step towards direct generative modelling, formulated as a single, end-to-end neural network, and builds upon foundations laid by Flow Matching, MeanFlow, and JiT, integrating their strengths into a cohesive framework. The approach enables a streamlined generation process, bypassing the need for iterative sampling or latent space manipulation, and marking a significant advancement in the field of generative modelling.

Pixel MeanFlow achieves low FID scores consistently

Scientists have developed pixel MeanFlow (pMF), a new approach to one-step latent-free image generation, achieving significant results on the ImageNet dataset. Experiments revealed strong performance at 256×256 resolution, with a Fréchet Inception Distance (FID) score of 2.22, and at 512×512 resolution, yielding an FID score of 2.48. The team measured image quality using the FID metric, demonstrating a key advancement in this generation regime. This breakthrough delivers a method that bypasses the need for multi-step sampling and latent spaces, traditionally core characteristics of modern flow-based image generation techniques.

Researchers formulated the network output and loss spaces separately, designing the network target to reside on a presumed low-dimensional image manifold, specifically x-prediction. The loss function was defined using MeanFlow in the velocity space, and a transformation was introduced to connect the image manifold with the average velocity field. Tests prove that this formulation better aligns with the manifold hypothesis, creating a more learnable target for the network. Measurements confirm that directly predicting a velocity field in pixel space results in catastrophic performance, highlighting the importance of the chosen approach.

Data shows that pMF learns a network capable of directly mapping noisy inputs to image pixels, enabling a “what-you-see-is-what-you-get” property absent in multi-step or latent-based methods. This property facilitated the natural incorporation of perceptual loss, further enhancing the quality of generated images. The study recorded that the use of perceptual loss contributes to improved image fidelity and visual appeal. The breakthrough delivers a competitive one-step latent-free generation method, marking a solid step towards direct generative modelling with a single, end-to-end neural network. Scientists achieved this by building upon recent advances in both few-/one-step sampling, such as Consistency Models and MeanFlow, and raw pixel space image generation, like “Just image Transformers”.

The work introduces a conversion relating velocity fields, average velocity, and the x-prediction, empirically demonstrating its alignment with the manifold hypothesis. Results demonstrate that pMF’s ability to simultaneously address manifold learning and trajectory modelling is critical for success in the pixel space. This research hopes to further advance the boundaries of diffusion/flow-based generative models, offering a pathway to more efficient and direct image generation.

Pixel MeanFlow achieves latent-free image generation with high

Scientists have developed pixel MeanFlow (pMF), a new approach to image generation that achieves strong results without relying on multi-step sampling or latent spaces. Their core principle involves separating the network output and loss spaces, targeting the network towards a low-dimensional image manifold while defining the loss through MeanFlow in velocity space. This is facilitated by a transformation linking the image manifold to the average velocity field. Experiments demonstrate pMF’s effectiveness in one-step, latent-free generation on the ImageNet dataset, achieving a Fréchet Inception Distance (FID) of 2.22 at 256×256 resolution and 2.48 at 512×512 resolution.

The authors highlight that the computational overhead of latent decoders, often overlooked in previous studies, is significant; a standard decoder can exceed the cost of their entire generator. They acknowledge limitations including the aggressive patch size used, which prioritises computational efficiency but may affect certain aspects of image quality. This research suggests that neural networks possess considerable expressive power and, when appropriately designed, can learn complex end-to-end mappings directly from noise to pixels. The findings contribute to the advancement of flow-based generative models by demonstrating a viable path towards single-step, latent-free image generation. Future work could explore refinements to the transformation between image manifolds and velocity fields, as well as investigate the potential for applying this approach to other generative tasks.

👉 More information
🗞 One-step Latent-free Image Generation with Pixel Mean Flows
🧠 ArXiv: https://arxiv.org/abs/2601.22158

Tags:

FID score Flow-based generation Imagenet latent-free generation manifold learning. pixel MeanFlow velocity space x-prediction

Pixel MeanFlow Achieves One-Step Latent-Free Image Generation Results

Pixel MeanFlow for direct image generation

Pixel MeanFlow achieves low FID scores consistently

Pixel MeanFlow achieves latent-free image generation with high

Rohail T.

Latest Posts by Rohail T.:

Accurate Quantum Sensing Now Accounts for Real-World Limitations

Quantum Error Correction Gains a Clearer Building Mechanism for Robust Codes

Models Achieve Reliable Accuracy and Exploit Atomic Interactions Efficiently