The absence of real-world ground truth (GT) remains a challenge in multi-exposure image fusion (MEF). Benchmarks synthesizing pseudo GT through algorithm ensembles. Existing methods, hampered by inherent imperfections of pseudo GT and fixed mapping relationships, show limited performance and robustness. To address the limitations, we propose a novel cross-modal diffusion framework that synergizes text prompts and semantic perception for MEF, termed as Diff-MEF. First, it reformulates MEF as a probabilistic estimation task with conditional diffusion model for progressive transition and fusion. Then, we explicitly infer semantic and exposure priors as text prompts and semantic perception to improve performance and robustness. The priors are synergized through multi-modal prior embedding and optimization guidance. On the one hand, regarding cross-modal interaction, multi-modal priors, including segmentation masks, and exposure- and content-aware text prompts, are embedded into diffusion process by dedicated encoders and refine visual features through a text-segmentation refinement module. On the other hand, a semantic-level contrastive loss builds a regularization between cross-modal features in the semantic space of CLIP to mitigate degradations introduced by pseudo GT and fusion distortions. Experiments demonstrate that Diff-MEF outperforms SOTA methods and pseudo GT with superior fusion performance and robustness across diverse exposure scenarios. Code is available at https://github.com/hanna-xu/Diff-MEF.
Xu et al. (Thu,) studied this question.