What question did this study set out to answer?

The aim is to improve the accuracy of network intrusion detection despite the presence of label noise in datasets.

April 10, 2026Open Access

Mitigating label noise in network intrusion detection via graph-based sample selection and purification

Key Points

The aim is to improve the accuracy of network intrusion detection despite the presence of label noise in datasets.
Proposed a data-centric relabeling framework with Normal Sample Discovery and Malicious Sample Screening.
Employed graph propagation for confident-sample selection and label propagation.
Constructed a K-NN graph to spread labels from confident samples to the wider dataset.
Utilized dual networks for screening uncertain samples post-propagation.
Achieved F1 scores of 0.81 and 0.98 under 40% label noise on two different datasets.
Demonstrated relative improvements of 17.39% and 11.36% over state-of-the-art methods.

Abstract

Machine learning has achieved notable progress in malicious traffic detection, yet its effectiveness highly depends on data that are sufficiently large and reliably labeled. In practice, many datasets are produced by automated labeling pipelines, which inevitably introduce label noise and, in turn, undermine detection performance. Consequently, maintaining robust and generalizable detection under label noise has become a central challenge in network intrusion detection. Existing approaches often emphasize intrinsic model robustness. However, noise can reshape the distribution of hard examples and bias the optimization objective, which may yield unstable decision boundaries and further degrade performance. In this paper, we propose a data-centric relabeling framework Formula: see text , comprising two components: Normal Sample Discovery (NSD) via graph propagation and Malicious Sample Screening (MSS) with dual networks. NSD proceeds in three steps: (1) confident-sample selection; (2) K-NN graph construction; and (3) label propagation. We first select high-confidence samples and assume their labels are correct, build a graph over all samples, and propagate labels from the confident subset to the full graph; samples that remain uncertain after propagation are forwarded to MSS for second-stage annotation. NSD aims to recover the majority of correctly labeled instances; these instances act as reliable anchors that guide MSS in labeling the remaining uncertain samples, thereby reducing label noise and stabilizing training. We evaluate Formula: see text on CIC-IDS2017 and DoHBrw-2020. Under 40% label noise, Formula: see text attains F1 scores of 0.81 and 0.98, respectively, yielding 17.39% and 11.36% relative improvements over state-of-the-art baselines.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ruifen Zhao

Jiangtao Ding

Qinhao Dong

Journals

Scientific Reports

Actions

Institutions

Zhejiang University of Technology

Zhejiang University of Science and Technology

Zhejiang Lab

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Mitigating label noise in network intrusion detection via graph-based sample selection and purification

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study