Text-to-video diffusion models have made significant progress. However, there is still a lack of dedicated research on generating fire scene videos with physical realism and visual fidelity. To address this gap, we propose text-to-video fire (T2VFire) scene generation. T2VFire uses GPT-4o as the core engine, which is integrated with an external fire-related knowledge base and a retrieval-augmented generation (RAG) mechanism that can be dynamically updated based on prompts. With the support of this knowledge, the system first expands the user's initial text description and generates a keyframe image. Then, through iterative prompt optimization, it guides a pretrained video diffusion model to generate fire scene videos with physical consistency. Experimental results show that T2VFire improves upon the physical consistency and visual realism of fire scene videos generated by current video generation models. This method provides a solid foundation for future smart firefighting and digital twin systems in building fire safety management.
Zheng et al. (Thu,) studied this question.