Accurate, real-time estimation of core body temperature (CBT) during physical activity is essential for monitoring heat strain and mitigating the risk of heat-related illness under hot environmental conditions. Although numerous data-driven algorithms using wearable sensors have been proposed, their practical reliability remains unclear due to substantial methodological heterogeneity and the absence of standardized evaluation. This study combined a systematic review with a standardized quantitative benchmark. A total of 38 studies employing non-invasive inputs for CBT estimation were identified. Of these, 14 eligible models, including Kalman filter–based methods, statistical models, and machine-learning approaches, were re-implemented and evaluated under identical preprocessing and evaluation settings using two independent datasets: Dataset 1 (treadmill walking, ) and Dataset 2 (cycling, ). The benchmark revealed notable differences between originally reported performance and reproduced performance under standardized conditions. For the widely used heart-rate–based extended Kalman filter, the root mean square error (RMSE) increased from typically reported values of 0.21–0.41 C to 0.41 C on Dataset 1 and 0.66 C on Dataset 2. Incorporating skin temperature improved tracking accuracy in some configurations, but performance gains were highly dependent on measurement site and dataset. Sensitivity for detecting elevated CBT ( 38.0 C) varied markedly across methods, particularly for the cycling protocol. In conclusion, no single CBT estimation approach consistently outperformed others across all settings. Heart-rate–only models provided a stable baseline under limited sensing conditions, whereas multimodal approaches offered conditional benefits in more controlled scenarios. This work establishes a standardized benchmark framework to support fair comparison, method selection, and future development of (wearable) CBT estimation technologies. • Systematic review of core temperature estimation during physical activity in hot environments. • Standardized benchmark across two controlled heat-exposure datasets reveals strong dataset dependence. • Results inform deployment of wearable heat-strain monitoring in occupational and built-environment settings.
Zhao et al. (Mon,) studied this question.