This study is conducted in 19 diverse catchments (43–7907 km²) across the humid mountainous regions of southern China. We develop a rigorous comparative framework to evaluate a data-driven Bidirectional Long Short-Term Memory (BiLSTM) model against a traditional conceptual Xinanjiang model within a single-catchment context. By systematically varying the training data from 30% to 80% of available flood events, this study aimed to quantify the data-performance relationship, identify the critical training data threshold at which deep learning becomes competitive, and uncover the physical catchment characteristics that controlling model suitability. A critical data threshold of 70–80% of available flood events (approximately 21–24 events) is identified, below which the conceptual model is superior and above which the BiLSTM achieves competitive performance. This threshold is fundamentally controlled by catchment scale, with small-scale catchments favouring the BiLSTM and large-scale catchments maintaining an advantage with the conceptual model—a pattern reflecting how monsoon-driven flood processes manifest differently across the region's physiographic gradient. Furthermore, the Nash-Sutcliffe Efficiency (NSE) and Kling-Gupta Efficiency (KGE) metrics exhibit different convergence patterns with increasing data availability, with implications for comprehensive model evaluation in data-limited contexts. These findings culminate in an actionable, scale-informed framework for model selection that can guide provincial hydrological bureaus in transitioning from traditional to deep learning approaches. • BiLSTM needs 70–80% training data (21–24 floods) to match conceptual models. • NSE and KGE metrics show different convergence thresholds (50% vs 70% MDR). • Catchment scale dictates model error, while topography governs model efficiency. • Provided actionable model selection rules based on catchment scale (PC1 value).
Li et al. (Sat,) studied this question.