Recently, a new achievement from the research team has attracted widespread attention —— CoMPaSS-FLUX.1 model. This is a LoRA adapter based on the FLUX.1 text-to-image diffusion model, designed to significantly enhance the understanding of object spatial relationships when generating images. The model has made significant progress in handling specific spatial relationships between objects, bringing new possibilities to the field of image generation.

image.png

The base model of CoMPaSS-FLUX.1 is FLUX.1-dev, with a LoRA rank of 16 and a file size of approximately 50MB, using the Diffusers framework. Its main purpose is to generate images with accurate spatial relationships, capable of creating compositions that require specific spatial arrangements, while enhancing spatial understanding while maintaining other capabilities.

In terms of performance, the key improvements of CoMPaSS-FLUX.1 are remarkable. According to the VISOR benchmark, the relative improvement reached 98%; in the T2I-CompBench spatial test, the improvement was 67%; and in the GenEval location evaluation, it reached a 131% relative improvement. In addition, CoMPaSS-FLUX.1 also performed well in image fidelity, with FID and CMMD scores lower than the base model, indicating an improvement in generation quality.

When using this model, users can refer to its effective prompts. The model performs best when describing spatial relationships, especially when the prompt includes clear descriptions of spatial relationships (such as "left," "right," "above," "below"), or clear spatial relationships involving two different objects (for example, "In the photo, A is to the right of B").

During the training process, CoMPaSS-FLUX.1 used data from the SCOP (Spatial Constraint-Oriented Pairing) data engine, covering about 28,000 carefully selected object pairs. These data have strict standards in terms of visual importance, semantic distinction, spatial clarity, object relationships, and visual balance.

The training process lasted for 24,000 steps, with a batch size configuration of 4, a learning rate set to 1e-4, and the use of the AdamW optimizer with a weight decay set to 1e-2.

huggingface:https://huggingface.co/blurgy/CoMPaSS-FLUX.1

Key Points:

🌟 The CoMPaSS-FLUX.1 model significantly improves spatial understanding during text-to-image generation, especially in handling relationships between objects.

📊 Performance evaluations show that the model has obvious improvements in multiple benchmark tests, maintaining high-quality generation results.

📚 The model training used a strictly filtered dataset, ensuring that the generated images have good spatial relationships and clarity visually.