This research investigates the ability of large language models to reconstruct disaster scenarios using multimodal data from Hurricane Harvey. By analyzing well-annotated tweet-image pairs that describe disaster impacts, the research evaluates five state-of-the-art models across two key tasks: spatiotemporal storm tracking and humanitarian impact summarization. Results show that models like GPT-4o and DeepSeek-R1 demonstrate strong reasoning capabilities, effectively aligning textual and visual evidence to infer daily disaster conditions. However, challenges remain in interpreting implicit humanitarian cues, such as emotional support or donation needs. The research proposes a rubric-based evaluation framework to assess transparency, groundedness, and narrative coherence. Findings underscore the promise and limits of large language models in crisis analysis and suggest future directions for integrating real-time social media with AI-assisted emergency response tools.
Yijun Gu (Fri,) studied this question.