Family, you have to hear about an amazing new achievement in the research world today — Flow-GRPO! This is no ordinary breakthrough; it’s like giving image generation models a "super evolution booster," propelling them from "bronze" all the way up to "king." Want to know how it does it? Grab your seat and let me tell you all about it!

Image Generation Models' "Growing Pains"

Current image generation models, such as those based on flow matching, have solid theoretical foundations and produce high-quality images that are impressive. However, they do face their own "growing pains," particularly when dealing with complex scenes that require arranging multiple objects, handling various attributes and relationships, or accurately rendering text within images.

image.png

Paper link: https://www.arxiv.org/pdf/2505.05470
Project link: https://github.com/yifan123/flow_grpo

Meanwhile, online reinforcement learning (RL) has shown great effectiveness in enhancing the reasoning capabilities of large language models. However, while RL has been mainly applied to early diffusion generation models and offline RL techniques like direct preference optimization, few have explored whether it could bring new breakthroughs to flow-matching generative models. It's like having a powerful key but not realizing its potential to unlock a door. Now, Flow-GRPO is here to "unlock" that door!

Training RL-based flow models is no easy task. First, the generation process of flow models is akin to a set track based on deterministic ordinary differential equations (ODE), following a step-by-step procedure during inference, without the ability for random sampling. On the other hand, RL acts like a curious child, relying on random exploration of different actions followed by feedback-based learning. These two approaches have very different characteristics: one requires strict adherence, while the other demands exploration. So how can they work together?

Secondly, efficient sampling is essential for online RL training. However, generating each sample using flow models requires many iterative steps, making the sampling process slow like a snail. The more advanced and complex the model, the worse this problem becomes, creating a real "snowball effect." Thus, improving sampling efficiency has become crucial for RL to make an impact on image or video generation tasks.

image.png

Flow-GRPO to the Rescue!

To tackle these challenges, Flow-GRPO shines brightly! It's like a super "magic toolbox" containing two incredible "magic" strategies.

The first magic is the "ODE-to-SDE conversion." Imagine converting a train that can only travel on fixed tracks into a car that can freely navigate on any road. Flow-GRPO transforms the original deterministic ODE into stochastic differential equations (SDE) while maintaining the same marginal distribution at every time step. This introduces randomness into the model, allowing it to explore different possibilities as RL requires. Before, the model generated images by walking straight down a path; now, with this transformation, it can explore different paths, finding better ways to generate images. Isn't that amazing?

The second magic is the "denoising reduction strategy." During training, Flow-GRPO acts like a smart "time management expert," reducing denoising steps to quickly collect training data. However, during inference, it restores the full denoising steps to ensure the generation of high-quality samples. Think of it as running: during training, take small quick steps to adapt quickly; during competition, run at a normal pace, ensuring both speed and quality.

image.png

How Does Flow-GRPO Perform in Practice?

Just how powerful is Flow-GRPO? Researchers tested it on various text-to-image (T2I) tasks, and the results were stunning!

In compositional image generation tasks, evaluated using the GenEval benchmark, this task isn’t simple. It requires precise object arrangement and attribute control, similar to building with Legos, where every piece must be placed correctly. With Flow-GRPO, the accuracy of the Stable Diffusion3.5 Medium (SD3.5-M) model soared from 63% to 95%, surpassing even the GPT-4o model! Previously, the generated images might have had incorrect object counts, misplaced colors, and awkward positions. Now, with Flow-GRPO, these issues are resolved, and the generated images look as if they've been magically arranged precisely.

In visual text rendering tasks, the SD3.5-M model's accuracy increased from 59% to 92% under Flow-GRPO. Previously, models might render text crookedly, missing parts or elements. Now, text is presented accurately and flawlessly in the images, like the perfect written captions for the visuals, with improvements far beyond just a little bit.

In human preference alignment tasks, Flow-GRPO also performed excellently. Using PickScore as the reward model, it ensures the generated images align more closely with human preferences. Moreover, while enhancing performance, it rarely exhibited reward cheating. What is reward cheating? Some models sacrifice image quality and diversity to increase reward scores, leading to blurry or repetitive images. But Flow-GRPO is different; it’s like a "justicer," ensuring both image quality and diversity while increasing reward scores dramatically.

Researchers also analyzed Flow-GRPO extensively. For instance, when addressing reward cheating problems, they tried combining various reward models, which resulted in local blurriness and reduced diversity, like covering a beautiful landscape photo with a foggy haze. Later, using the KL constraint method significantly improved the results. After adjusting the KL coefficient, the model optimized task-specific rewards without harming overall performance, finding the perfect "balance point."

There was also an analysis of the denoising reduction strategy. By reducing the number of timesteps during training from 40 to 10, the training speed improved by over four times, with no negative impact on the final reward score. It's like driving; previously, you had to go slowly to reach the destination, but now, with a smoother route, you get there faster without compromising anything!

Noise levels also affect the model. Proper noise levels in SDEs can enhance image diversity and exploration abilities, greatly aiding RL training. However, excessive noise degrades image quality, much like randomly splashing ink onto a fine painting, ruining it. Research found that setting noise levels around 0.7 works best, balancing image quality and enabling better exploration of various possibilities.

Flow-GRPO's generalization capability is also strong. In unseen scene tests, it accurately captures the quantity, color, and spatial relationships of objects, and handles unfamiliar object categories effortlessly. From generating 2-4 objects during training to producing 5-6 objects during testing, it handles everything easily, like a highly capable student who learns by analogy and masters every question!

Future Prospects and Challenges

Although Flow-GRPO has performed exceptionally well in text-to-image tasks, researchers aren't stopping here. They're already looking towards broader applications — video generation. However, this brings new challenges.

First, designing rewards. In video generation, simple reward models won't suffice; more complex and effective reward models are needed to create realistic and smooth videos. It's like grading movies; you can't just consider the visuals but also factors like plot and sound effects.

Second, balancing multiple rewards. Video generation optimizes multiple objectives like realism, smoothness, and coherence. These objectives can act like unruly children with differing opinions, making balance difficult. Researchers need to find ways to make them "coexist harmoniously," which isn't easy.

Finally, scalability. Video generation consumes more resources than image generation, like a "big eater." To apply Flow-GRPO to video generation, more efficient data collection and training methods are required, otherwise, the "resource bottleneck" will not meet its "appetite."

But these challenges won't stop Flow-GRPO from advancing. With researchers' efforts, Flow-GRPO will continue to shine in the image generation field and potentially create miracles in video generation and other areas, bringing us more surprises! Perhaps in the future, the visuals in the movies we watch and the games we play will be generated by Flow-GRPO. Let's wait and see!