Fine-grained bird identification is crucial for ecosystem monitoring, species conservation, and habitat assessment. However, in real-world environments, there are challenges such as imbalances in modality quality and interference from background noise. To improve fine-grained audio-visual bird classification under heterogeneous modality conditions, we propose an audio-visual feature fusion framework named CABIF-Net. This framework introduces a confidence-based Top-K mean pooling module to select key frames to optimize the visual representations at the video level. Through a Confidence Calibration module, it dynamically assesses the reliability of the visual and audio modalities and integrates a Bidirectional Inter-modulation Fusion module to achieve controllable cross-modal information interaction. Experiments were conducted on the publicly available SSW60 dataset, characterized by severe noise and imbalance in modality quality, and the self-built Birds21 dataset with balanced modality quality. The experimental results show that the classification accuracies were 85.76% and 96.67%, respectively, outperforming existing unimodal methods and several mainstream fusion strategies. Weight distribution and visualization analyses further indicate that the proposed method can adaptively adjust the modality contributions based on discriminative evidence at the sample level. This study provides an effective framework for fine-grained audio-visual bird species recognition.
Li et al. (Tue,) studied this question.