River flooding and urban inundation require forecasting systems that can anticipate future risk, rather than systems that only estimate the current water state. However, real-world closed-circuit television (CCTV)-based flood datasets often contain imbalanced or temporally inconsistent risk labels. In addition, most image-based approaches remain limited to static scene understanding. This study proposes a dataset reformulation and temporal multi-task forecasting framework for CCTV-based flood-risk prediction. First, we introduce a site-relative relabeling strategy that converts noisy frame-level danger annotations into four risk levels using visual flood indicators and lightweight environmental cues. Second, we transform the original frame-based dataset into site-hour sequences for multi-horizon forecasting at 1 h, 3 h, and 6 h. Third, we evaluate image-only, weather-only, and naive multimodal configurations to examine the role and limitations of heterogeneous sensor fusion. On the reformulated dataset, the image-only temporal model achieved the best overall performance, with a mean Intersection over Union (mIoU) of 0.892, Dice score of 0.940, macro-averaged F1 score (Macro-F1) of 0.532, and high-risk recall of 0.642. In contrast, naive multimodal fusion reduced Macro-F1 to 0.267 and high-risk recall to 0.070. This result indicates that additional weather inputs do not automatically improve prediction when cross-modal signals are noisy, weakly correlated, or temporally misaligned. The ablation results further showed that removing temporal modeling decreased Macro-F1 to 0.227 and high-risk recall to 0.000. These findings demonstrate that dataset reformulation and temporal modeling are essential for extending CCTV-based flood analysis from static estimation to future risk forecasting. They also suggest that robust cross-modal alignment is required before multimodal sensing can provide reliable performance gains.
Lee et al. (Tue,) studied this question.