Text-to-image (T2I) diffusion models generate high-quality images but often fail to capture the spatial relations specified in text prompts. This limitation can be traced to two factors: lack of fine-grained spatial supervision in training data and inability of text embeddings to encode spatial semantics. We introduce InfSplign, a training-free inference-time method that improves spatial alignment by adjusting the noise through a compound loss in every denoising step. Proposed loss leverages different levels of cross-attention maps extracted from the backbone decoder to enforce accurate object placement and a balanced object presence during sampling. The method is lightweight, plug-and-play, and compatible with any diffusion backbone. Our comprehensive evaluations on VISOR and T2I-CompBench show that InfSplign establishes a new state-of-the-art (to the best of our knowledge), achieving substantial performance gains over the strongest existing inference-time baselines and even outperforming the fine-tuning-based methods. Codebase is available at GitHub.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sarah Rastegar
Violeta Chatalbasheva
Sieger Falkena
Building similarity graph...
Analyzing shared references across papers
Loading...
Rastegar et al. (Sat,) studied this question.
www.synapsesocial.com/papers/698586388f7c464f2300a2a2 — DOI: https://doi.org/10.13016/m2bdjk-f37j