What question did this study set out to answer?

This research aims to develop a control method using multi-agent reinforcement learning for improved management of PVT and heat pump systems.

March 26, 2026Open Access

Development of an Optimal Multi-Agent Reinforcement Learning Control Method for an Integrated PVT–Heat Pump System

Key Points

This research aims to develop a control method using multi-agent reinforcement learning for improved management of PVT and heat pump systems.
Utilized multi-agent reinforcement learning with Proximal Policy Optimization (PPO) and Dueling DQN
Implemented a year-long co-simulation of a reference office building with three agents: PVT, AWHP, and fan-coil loop
Minimized tariff-weighted energy costs while considering comfort and safety constraints
PPO achieved the shortest payback period of 15.4 years and the lowest 20-year life-cycle cost
PPO sustained ∼16% electrical and 22.5% thermal efficiency under storage coupling
Interior flow policies led to significant reductions in energy demand: PVT -21%, FCU heating -32–35%, and cooling -18%

Abstract

• PPO achieved shortest payback (15.4 yr) without comfort loss. • Interior flow policies cut pumping: PVT −21%, FCU heat −32–35%, cool −18%. • PPO sustained ∼16% electrical and 22.5% thermal efficiency under storage coupling. Hybrid building systems that couple photovoltaic–thermal (PVT) collectors, an air-to-water heat pump (AWHP), and stratified storage can cut operating cost but are hard to control due to storage delays, seasonal nonstationarity, and subsystem coupling. We cast flow-rate control as a model-free multi-agent reinforcement learning (MARL) problem with centralized training and decentralized execution. Three agents (PVT, AWHP, fan-coil loop) act every 60 s in a year-long co-simulation of a reference office building. The reward minimizes tariff-weighted energy cost with comfort and constraint penalties; uniform safety bounds and slew-rate limits are applied. We evaluate Proximal Policy Optimization (PPO; continuous actions) and a discrete Dueling DQN against a supervised DNN and a rule-based controller. PPO learns smooth, storage-aware modulation that favors interior flow setpoints, preserves stratification, and reduces safety interventions. Over the full year, PPO delivers the best economics, achieving the shortest payback period (15.4 years) and the lowest 20-year life-cycle cost, outperforming Dueling DQN (16.0 years), the supervised DNN (17.3 years), and a conventional non-PVT/non-storage reference (17.7 years). Overall, PPO reduces tariff-weighted operating cost while maintaining comfort and constraint compliance, demonstrating quantitatively superior coordination of PVT charging, AWHP operation, and FCU draw in storage-coupled buildings.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Soowon Chae

Yujin Nam

Journals

Energy and AI

Actions

Institutions

Pusan National University

Research Institute of Industrial Science and Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Development of an Optimal Multi-Agent Reinforcement Learning Control Method for an Integrated PVT–Heat Pump System

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider