Reproducibility lies at the basis of the empirical method: a novel approach will be widely adopted if its experimental results can be validated and reproduced by the community. Previous work on reproducibility in Information Retrieval (IR) has mainly addressed the reproducibility and replicability of offline experiments, with a few exceptions that replicate user studies. To the best of our knowledge, no previous work has investigated how reproducibility affects real users. In this paper, we do that by evaluating and comparing the reproducibility of an IR system both offline and online. We consider a reference system and generate a constellation of reproduced systems with varying parameters. We select \(6\) systems with different degrees of offline reproducibility. We then run a between-subjects online experiment with \(280\) participants and collect clicks to evaluate online reproducibility. Results show that real users do not perceive moderate variations of the reproducibility degree of systems, while they become relevant when the difference with the original system increases. Furthermore, we trained a click model to evaluate online reproducibility with simulated clicks. Results are not consistent with those from the user study, suggesting that better click models are needed to evaluate online reproducibility. Our data and source code is publicly available: https://github.com/angelogeninatti/reproducibilityLogs
Cossatin et al. (Fri,) studied this question.