What question did this study set out to answer?

The aim is to develop an effective framework for detecting audio deepfakes by analyzing synthetic speech for inconsistencies.

April 15, 2026Open Access

AUDIO DEEPFAKE DETECTION USING HYBRID CNN-BiLSTM WITH FEATURE FUSION OF MFCC AND MEL SPECTROGRAM

Key Points

The aim is to develop an effective framework for detecting audio deepfakes by analyzing synthetic speech for inconsistencies.
Combined Mel Frequency Cepstral Coefficients (MFCC) and Mel Spectrogram for feature fusion.
Utilized a one-dimensional Convolutional Neural Network (CNN) to identify local spectral irregularities.
Implemented a Bidirectional Long Short-Term Memory (BiLSTM) layer to analyze temporal evolution of these irregularities.
Incorporated a self attention layer to highlight critical moments in audio recordings.
Achieved 96.8% accuracy on the ASVspoof 2019 benchmark after testing on various synthetic systems.
Reported precision of 96.2%, recall of 97.1%, and F1-score of 96.6%.
System operates efficiently on standard laptop hardware with verdicts returned in under two seconds.

Abstract

Voice cloning and speech synthesis tools have become widely accessible in recent years, raising genuine concerns about their misuse in fraud, misinformation, and identity theft. Detecting such fabricated audio is no longer an academic curiosity but a pressing societal need. This work introduces a lightweight yet effective detection framework that listens for subtle inconsistencies in synthetic speech by combining two complementary audio representations — Mel Frequency Cepstral Coefficients (MFCC) and Mel Spectrogram — and feeding their fused form into a hybrid deep learning pipeline. The pipeline first applies a one-dimensional Convolutional Neural Network (CNN) to spot local spectral irregularities, then passes the output through a Bidirectional Long Short-Term Memory (BiLSTM) layer to track how those irregularities evolve over time in both directions, and finally uses a self attention layer to spotlight the most telling moments in the recording. When tested on the ASVspoof 2019 Logical Access benchmark, which pits real speech against nineteen different synthetic systems, the proposed model records an accuracy of 96.8 %, precision of 96.2 %, recall of 97.1 %, and F1-score of 96.6 %. The whole system is wrapped in a Streamlit web interface that returns a verdict in under two seconds on ordinary laptop hardware, showing that strong protection against audio deepfakes does not demand expensive infrastructure.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Subathra R

Thiru Selvam T

Prem Kumar R

Actions

Institutions

Government College of Science

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AUDIO DEEPFAKE DETECTION USING HYBRID CNN-BiLSTM WITH FEATURE FUSION OF MFCC AND MEL SPECTROGRAM

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider