Open environmental monitoring datasets are increasingly used in water-pollution research because they provide broad spatial and temporal coverage and support reproducible large-scale analyses. However, their interpretation may depend strongly on preprocessing decisions, particularly when many observations are reported below the limit of quantification (LOQ). This study evaluated the sensitivity of inferred heavy-metal pollution patterns to preprocessing choices in open European surface-water monitoring data. Publicly available Waterbase records for cadmium, lead, and nickel were restricted to rivers and lakes. After removing missing values and a subset of implausible extreme observations above 1000 µg/L, the main analytical dataset contained 1,475,409 observations. Below-LOQ records accounted for 66.6% of cadmium, 57.3% of lead, and 36.1% of nickel observations. A separate censoring-analysis dataset (1,259,636 observations) was used to compare three scenarios: removal of below-LOQ observations, substitution with half the LOQ, and substitution with the full LOQ. Censoring treatment substantially affected concentration summaries, with the strongest sensitivity observed for cadmium, followed by lead, whereas nickel was comparatively more stable. The effect persisted after station-year aggregation and also altered hotspot identification. These findings show that although open monitoring data are valuable for pollution research, robust interpretation requires explicit and transparent reporting of preprocessing decisions.
Seweryn Lipiński (Thu,) studied this question.