Cloud computing infrastructures operate millions of physical servers hosting vast numbers of Virtual Machines, delivering large-scale services at the cost of significant energy consumption. These high-performance computing activities generate substantial heat, necessitating robust cooling mechanisms to prevent system failures. However, traditional cooling systems further amplify energy usage, positioning thermal management as a critical issue in data centers. Current workload allocation policies often overlook the thermal dynamics of server environments, limiting the ability to accurately simulate or predict temperature behavior across computing nodes. This gap results in inefficient energy use, increased operational costs, and a higher risk of thermal-induced hardware degradation. To mitigate these issues, thermal-aware optimization frameworks are needed—systems that dynamically manage nodes by integrating temperature as a decisive factor in scheduling and resource allocation. These frameworks promise to enhance energy efficiency, reduce cooling expenses, and extend hardware lifespan by controlling temperature-driven performance impacts. This review explores the key challenges in thermal management and assesses a variety of proposed solutions, including thermal-aware algorithms and control strategies. It evaluates their effectiveness in terms of energy savings, cost reduction, and improved system reliability across data centers.
Amahrouch et al. (Wed,) studied this question.