Deep neural networks are widely used for image classification in different fields, although selecting an appropriate architecture often remains a trial-and-error process. The purpose of this work is to investigate a convolutional neural network architecture used to detect whale pulses in spectrograms in order to better understand the causes of its underperformance. By examining the behaviour of its internal layers, we show that the early convolutional blocks capture the most informative acoustic features, while deeper layers provide limited additional benefit and, under the considered training conditions, may even degrade classification accuracy. Based on these observations, we derive a simplified architecture consisting of only the first two convolutional layers followed by a lightweight classifier. This network achieves near-optimal performance, improving accuracy from 87% to 98%, and exhibits substantially lower variability between repetitions compared to the original model.
Román-Ruiz et al. (Sat,) studied this question.