What question did this study set out to answer?

The aim is to develop a robust framework for music genre recognition that addresses limitations in existing techniques.

March 28, 2026Open Access

HDAWN-HRDAE: A hybrid framework for the music genre classification with the dual temporal and frequency feature modeling

Key Points

The aim is to develop a robust framework for music genre recognition that addresses limitations in existing techniques.
Developed a Hybrid Deep Autoencoders and Wavelet Networks (HDAWN) framework.
Utilized Hierarchical Recurrent Deep Autoencoders (HRDAE) to capture temporal patterns from spectrograms.
Employed symmetric and asymmetric wavelet groups to optimize feature extraction for various conditions.
Conducted evaluations using the GTZAN dataset for performance comparison.
Achieved 94.55% classification accuracy, outperforming conventional DNN-SVM baselines by 13.88 percentage points.
Demonstrated a 1-4% improvement over several recent deep learning approaches.
Maintained low computational complexity of 0.03 GMAC, significantly less than deeper architectures like CRNN-9 and ResNet18.

Abstract

Music genre recognition is crucial for organization, content recommendation, and retrieval applications. Existing techniques often have trouble with multi-dimensional and capturing dynamic features under varying ambient conditions. A novel Hybrid Deep Autoencoders and Wavelet Networks (HDAWN) framework has been developed to address these robust music genre recognition limitations. Autoencoders spontaneously extract optimal feature combinations, whereas wavelet networks identify periodic frequency characteristics adaptable to diverse scenarios. Hierarchical Recurrent Deep Autoencoders (HRDAE) also capture temporal patterns in the input spectrogram, enhancing the system’s temporal feature representation. Symmetric and asymmetric wavelet groups optimize wavelet selection using the George Tzanetakis (GTZAN) dataset to compare comprehensively. Evaluated on GTZAN dataset, the proposed approach gets 94.55% classification accuracy, which is 13.88 percentage points higher than the performance of conventional DNN-SVM baselines (80.67%), and 1∼4% higher than several recent deep learning approaches. In addition, the framework has a low computational complexity, which is only 0.03 GMAC, which is ∼6∼60 times less than deeper CNN-based architectures such as CRNN-9 (0.27 GMAC) and ResNet18 + 3D (1.79 GMAC). These results show that the proposed hybrid architecture is able to simultaneously enhance the classification accuracy and computational efficiency, which is a good trade-off between the performance and resource consumption.

HDAWN-HRDAE: A hybrid framework for the music genre classification with the dual temporal and frequency feature modeling

Key Points

Abstract

Cite This Study