Intrusion Detection Systems (IDS) are considered critical security tools in ensuring network infrastructure security. However, recent studies on machine learning-based IDS systems are often constrained by their heavy dependence on a single dataset, lack of reproducibility, and lack of transparency in evaluating their performance. In addressing these challenges, a unified and transparent framework for evaluating IDS systems is proposed, which focuses on integrating feature harmonization, multi-model benchmarking, and statistical validation. In achieving this objective, a preprocessing pipeline is designed to harmonize features of both legacy and contemporary network intrusion datasets, i.e., NSL-KDD and CICIDS2017, respectively. This framework will assess various learning models, including supervised, unsupervised, deep learning, and ensemble-based models, through cross-validation and statistical tests such as Wilcoxon signed-rank, McNemar’s, and DeLong tests. Experimental results demonstrate that the Random Forest model performs best in terms of performance metrics, i.e., 98.0% accuracy and 97.0% F1-score on the harmonized data set. Moreover, feature harmonization is found to be the most important factor in improving performance using ablation analysis. Besides, a novel approach of using a cryptographic logging mechanism using SHA-256 hash chaining is proposed for tamper-evident traceability and reproducibility of results in experiments, though it is not as effective as using a blockchain-based approach. Although effective in its application, it is based on manual feature alignment and hence might not be effective in highly heterogeneous data sets.This work provides a unified, reproducible, and statistically grounded framework for evaluating IDS systems, focusing on generalization and transparency in cybersecurity research.
Mishra et al. (Mon,) studied this question.