What question did this study set out to answer?

This work aims to enhance the spatial alignment of images generated by text-to-image diffusion models without retraining.

February 6, 2026Open Access

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Key Points

This work aims to enhance the spatial alignment of images generated by text-to-image diffusion models without retraining.
Introduced the InfSplign method during inference instead of training.
Utilized a compound loss to adjust noise in each denoising step.
Leveraged cross-attention maps from the backbone decoder for object placement and presence.
Achieved state-of-the-art performance on VISOR and T2I-CompBench benchmarks.
Outperformed existing inference-time baselines and fine-tuning methods.

Abstract

Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sarah Rastegar

Violeta Chatalbasheva

Sieger Falkena

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

InfSplign: Inference-Time Spatial Alignment of Text-to-Image Diffusion Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study