Twenty-eight within-subject counterfactual experiments across 2, 047 tabular datasets, plus a boundary experiment on 129 temporal datasets, measuring the severity of four data leakage classes in machine learning. Class I (estimation — fitting scalers on full data) is negligible: all nine conditions produce |ΔAUC| ≤ 0. 005. Class II (selection — peeking, seed cherry-picking) is substantial: ~90% of the measured effect is noise exploitation that inflates reported scores. Class III (memorization) scales with model capacity: dᵦ = 0. 37 (Naive Bayes) to 1. 11 (Decision Tree). Class IV (boundary) is invisible under random CV. The textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.
Simon Roth (Fri,) studied this question.