What question did this study set out to answer?

The aim is to develop a robust and interpretable deepfake detection method by integrating both spatial and frequency features.

April 10, 2026Open Access

A hybrid spatial–frequency attention-based algorithm using efficientnet for robust and interpretable deepfake detection

Key Points

The aim is to develop a robust and interpretable deepfake detection method by integrating both spatial and frequency features.
Utilized EfficientNet-B7 as the backbone for feature extraction.
Combined RGB visual representations with discrete cosine transform (DCT) frequency elements.
Applied convolutional block attention module (CBAM) to highlight critical information.
Conducted experiments on the FaceForensics++ C23 dataset to evaluate performance.
Achieved a ROC-AUC score of 0.997, indicating state-of-the-art performance.
Demonstrated high precision-recall balance and efficient training convergence.
Showcased enhanced generalization capabilities and interpretability in findings.
Utilized CAM-based visualizations to identify areas on faces prone to manipulation.

Abstract

The recent pace of generative media synthesis methods has greatly enhanced the credibility and accessibility of deepfake content, which causes serious risks to digital trust, authenticity of media, and forensic security. Existing deepfake detection methods are usually limited to either spatial domain visual cues or frequency domain artifacts, which leads to their limited robustness, poor generalization under realistic compression and poor interpretability. To overcome them, this paper will introduce a generalizable hybrid spatial-frequency deepfake detector, the proposed scheme combines both RGB-based visual representations with discrete cosine transform (DCT) frequency elements into a high-capacity convolutional network with attention-based refining. The suggested framework uses an EfficientNet-B7 backbone to identify rich hierarchical features and a convolutional block attention module (CBAM) to adaptively highlight information that is of interest to manipulation including spatial and channel-wise information. The early combination of spatial and frequency information allows the model to mutually exploit semantic inconsistencies and fine-scale high-frequency distortions added in the process of generating synthetic content. Comprehensive experiments of the FaceForensics + + C23 data set show that the proposed methodology has state-of-the-art performance with a ROC-AUC of 0.997, as well as high precision-recall balance and convergence of the training process. Further class separability is supported by feature-space analysis and prediction probability distributions and more complex CAM-based visualizations give significant forensic descriptions by identifying manipulation-prone regions of the faces. The high detection accuracy, the increased potential of generalization, and the greater interpretability are the factors that underline the efficiency of the suggested hybrid framework and confirm its appropriateness to the use in the field of the real-life deepfake forensics.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Kumar et al. (Wed,) studied this question.

www.synapsesocial.com/papers/69d8967d6c1944d70ce07f5b — DOI: https://doi.org/10.1038/s41598-026-46086-9

Authors

Mohit Kumar

Ashwani Kumar

Vikram Yadav

Journals

Scientific Reports

Actions

Institutions

Amazon (United States)

Symbiosis International University

Central University of Jharkhand

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A hybrid spatial–frequency attention-based algorithm using efficientnet for robust and interpretable deepfake detection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion