What question did this study set out to answer?

The study aims to improve flood risk prediction using CCTV data by reformulating the dataset and employing temporal multi-task learning.

June 4, 2026Open Access

Vision-Based Environmental Sensing for Flood Risk Forecasting: Dataset Relabeling and Temporal Multi-Task Learning

Key Points

The study aims to improve flood risk prediction using CCTV data by reformulating the dataset and employing temporal multi-task learning.
Proposed a site-relative relabeling strategy to convert noise-laden annotations into risk levels.
Transformed a frame-based dataset into site-hour sequences for multi-horizon forecasting.
Evaluated image-only, weather-only, and naive multimodal configurations for sensor fusion analysis.
The image-only temporal model achieved a mean Intersection over Union (mIoU) of 0.892 and a Dice score of 0.940.
Naive multimodal fusion reduced the Macro-F1 score to 0.267 and high-risk recall to 0.070, indicating ineffective cross-modal inputs.
Removing temporal modeling decreased Macro-F1 to 0.227 with a high-risk recall drop to 0.000.

Abstract

River flooding and urban inundation require forecasting systems that can anticipate future risk, rather than systems that only estimate the current water state. However, real-world closed-circuit television (CCTV)-based flood datasets often contain imbalanced or temporally inconsistent risk labels. In addition, most image-based approaches remain limited to static scene understanding. This study proposes a dataset reformulation and temporal multi-task forecasting framework for CCTV-based flood-risk prediction. First, we introduce a site-relative relabeling strategy that converts noisy frame-level danger annotations into four risk levels using visual flood indicators and lightweight environmental cues. Second, we transform the original frame-based dataset into site-hour sequences for multi-horizon forecasting at 1 h, 3 h, and 6 h. Third, we evaluate image-only, weather-only, and naive multimodal configurations to examine the role and limitations of heterogeneous sensor fusion. On the reformulated dataset, the image-only temporal model achieved the best overall performance, with a mean Intersection over Union (mIoU) of 0.892, Dice score of 0.940, macro-averaged F1 score (Macro-F1) of 0.532, and high-risk recall of 0.642. In contrast, naive multimodal fusion reduced Macro-F1 to 0.267 and high-risk recall to 0.070. This result indicates that additional weather inputs do not automatically improve prediction when cross-modal signals are noisy, weakly correlated, or temporally misaligned. The ablation results further showed that removing temporal modeling decreased Macro-F1 to 0.227 and high-risk recall to 0.000. These findings demonstrate that dataset reformulation and temporal modeling are essential for extending CCTV-based flood analysis from static estimation to future risk forecasting. They also suggest that robust cross-modal alignment is required before multimodal sensing can provide reliable performance gains.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper