March 3, 2026Open Access

Photorealistic fire scene video generation via multimodal large language model and pre-trained video diffusion model

Key Points

Fire scene videos exhibit improved physical consistency and visual realism through advanced multimodal techniques.
Experimental results indicate that T2VFire outperforms existing video generation models in producing realistic fire scenes.
Utilizing a retrieval-augmented generation mechanism, T2VFire expands text prompts for better keyframe production.
This work lays the groundwork for future applications in smart firefighting and fire safety management.

Abstract

Text-to-video diffusion models have made significant progress. However, there is still a lack of dedicated research on generating fire scene videos with physical realism and visual fidelity. To address this gap, we propose text-to-video fire (T2VFire) scene generation. T2VFire uses GPT-4o as the core engine, which is integrated with an external fire-related knowledge base and a retrieval-augmented generation (RAG) mechanism that can be dynamically updated based on prompts. With the support of this knowledge, the system first expands the user's initial text description and generates a keyframe image. Then, through iterative prompt optimization, it guides a pretrained video diffusion model to generate fire scene videos with physical consistency. Experimental results show that T2VFire improves upon the physical consistency and visual realism of fire scene videos generated by current video generation models. This method provides a solid foundation for future smart firefighting and digital twin systems in building fire safety management.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Zheng et al. (Thu,) studied this question.

synapsesocial.com/papers/69a75adcc6e9836116a213a7 https://doi.org/https://doi.org/10.26599/cvm.2025.9450511

Bookmark

View Full Paper