What question did this study set out to answer?

This research aims to examine the impact of non-determinism on deep learning performance distributions across various architectures and tasks.

May 6, 2026Open Access

Measuring deep learning performance - an empirical study of performance distributions across architectures and tasks

Puntos clave

This research aims to examine the impact of non-determinism on deep learning performance distributions across various architectures and tasks.
Conducted 186 experiments on different deep learning architectures for image classification and time series forecasting.
Executed each experiment 100 times with varying random seeds to create performance distributions.
Quantified robustness using metrics for spread, symmetry, and tail risk.
Performance distributions are often non-Gaussian, especially in time series forecasting.
Time series models exhibit significantly higher tail risk, with nearly three times more underperforming outliers compared to image classification models.
Mean performance does not reliably predict robustness, indicating the need for distributional analysis for model selection.

Resumen

Abstract Non-determinism in deep learning algorithm design and implementation leads to performance variation, meaning model performance is not a single value, but rather a distribution. These model performance distributions are underexplored despite their impact on robustness. We investigate the robustness of deep learning performance to sources of non-determinism, specifically focusing on how performance distributions differ across various architectures and tasks. We conducted 186 experiments on state-of-the-art image classification (ResNet, ViT) and time series forecasting (Autoformer, iTransformer, NLinear, TSMixer) architectures. Each experiment was run 100 times with different random seeds to generate performance distributions, resulting in 18,600 runs. Robustness was quantified using metrics for spread, symmetry, and tail risk. Performance distributions are frequently non-Gaussian, particularly in time series forecasting. Model size does not systematically affect robustness – larger image classification models show fewer outliers but not lower spread, while smaller time series models show lower spread but more extreme underperformers. Training duration does not scale linearly; early stopping effectively balances performance and robustness. Mean performance does not predict robustness – time series forecasting shows moderate correlation while image classification shows none. Time series models produce nearly three times more underperforming outliers than image classification models, indicating substantially higher tail risk. Tail risk poses serious concerns for Trustworthy AI in high-stakes applications. Models performing well on average may exhibit long tails and extreme outliers revealed only through distributional analysis. Mean performance alone should not guide model selection; assessment of spread, symmetry, and tail risk is essential for reliable model assessment where consistent performance is critical.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Coakley et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69fa98bd04f884e66b532802 — DOI: https://doi.org/10.1038/s41598-026-49656-z

Authors

Kevin L. Coakley

Odd Erik Gundersen

Journals

Scientific Reports

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Measuring deep learning performance - an empirical study of performance distributions across architectures and tasks

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion