What question did this study set out to answer?

This research aims to evaluate the severity of different types of data leakage in machine learning models.

April 5, 2026Open Access

Which Leakage Types Matter?

Key Points

This research aims to evaluate the severity of different types of data leakage in machine learning models.
Conducted 28 within-subject counterfactual experiments
Analyzed 2,047 tabular datasets and 129 temporal datasets
Measured severity of four classes of data leakage
Class I leakage (estimation) has negligible effects with |ΔAUC| ≤ 0.005
Class II leakage (selection) contributes significantly to inflated scores; ~90% is noise exploitation
Class III leakage (memorization) scales with model capacity, ranging from 0.37 to 1.11
Class IV leakage (boundary) is undetectable under random cross-validation

Abstract

Twenty-eight within-subject counterfactual experiments across 2, 047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation — fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0. 005. Class II (selection — peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: dᵦ = 0. 37 (Naive Bayes) to 1. 11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Simon Roth (Fri,) studied this question.

synapsesocial.com/papers/69d1fcfda79560c99a0a2c95 https://doi.org/https://doi.org/10.5281/zenodo.19406148

Bookmark

View Full Paper