What question did this study set out to answer?

This research aims to develop a robust framework for multi-exposure image fusion that addresses limitations of current methods due to the lack of real-world ground truth.

March 26, 2026

Diff-MEF: Cross-modal Diffusion Framework with Text Prompts and Semantic Perception for Multi-exposure Image Fusion

Key Points

This research aims to develop a robust framework for multi-exposure image fusion that addresses limitations of current methods due to the lack of real-world ground truth.
Developed a cross-modal diffusion framework named Diff-MEF for image fusion.
Utilized a probabilistic estimation approach and conditional diffusion model for fusion tasks.
Incorporated text prompts and semantic perception to enhance performance.
Implemented multi-modal prior embedding with dedicated encoders for feature refinement.
Applied a semantic-level contrastive loss to regularize cross-modal features.
Diff-MEF outperformed state-of-the-art methods in various exposure scenarios.
Achieved superior robustness compared to methods relying on pseudo ground truth.
Demonstrated enhanced fusion quality through effective integration of semantic insights.

Abstract

The absence of real-world ground truth (GT) remains a challenge in multi-exposure image fusion (MEF). Benchmarks synthesizing pseudo GT through algorithm ensembles. Existing methods, hampered by inherent imperfections of pseudo GT and fixed mapping relationships, show limited performance and robustness. To address the limitations, we propose a novel cross-modal diffusion framework that synergizes text prompts and semantic perception for MEF, termed as Diff-MEF. First, it reformulates MEF as a probabilistic estimation task with conditional diffusion model for progressive transition and fusion. Then, we explicitly infer semantic and exposure priors as text prompts and semantic perception to improve performance and robustness. The priors are synergized through multi-modal prior embedding and optimization guidance. On the one hand, regarding cross-modal interaction, multi-modal priors, including segmentation masks, and exposure- and content-aware text prompts, are embedded into diffusion process by dedicated encoders and refine visual features through a text-segmentation refinement module. On the other hand, a semantic-level contrastive loss builds a regularization between cross-modal features in the semantic space of CLIP to mitigate degradations introduced by pseudo GT and fusion distortions. Experiments demonstrate that Diff-MEF outperforms SOTA methods and pseudo GT with superior fusion performance and robustness across diverse exposure scenarios. Code is available at https://github.com/hanna-xu/Diff-MEF.

Bookmark

Diff-MEF: Cross-modal Diffusion Framework with Text Prompts and Semantic Perception for Multi-exposure Image Fusion

Key Points

Abstract

Cite This Study