What question did this study set out to answer?

The aim is to create a robust self-play mechanism to address complex routing problems with multiple constraints.

January 25, 2026Open Access

TSS GAZ PTP: Towards Improving Gumbel AlphaZero with Two-Stage Self-Play for Multi-Constrained Electric Vehicle Routing Problems

Key Points

The aim is to create a robust self-play mechanism to address complex routing problems with multiple constraints.
Introduce a Two-Stage Self-Play GAZ PTP framework.
Utilize Gumbel Monte Carlo Tree Search in the first stage against a greedy opponent.
Transition to a second stage with both agents using Gumbel MCTS for balanced competition.
TSS GAZ PTP consistently outperforms existing deep reinforcement learning methods.
Significant improvements observed on large-scale Electric Vehicle Routing Problem instances.

Abstract

Deep reinforcement learning (DRL) with self-play has emerged as a promising paradigm for solving combinatorial optimization (CO) problems. The recently proposed Gumbel AlphaZero Plan-to-Play (GAZ PTP) framework adopts a competitive training setup between a learning agent and an opponent to tackle classical CO tasks such as the Traveling Salesman Problem (TSP). However, in complex and multi-constrained environments like the Electric Vehicle Routing Problem (EVRP), standard self-play often suffers from opponent mismatch: when the opponent is either too weak or too strong, the resulting learning signal becomes ineffective. To address this challenge, we introduce Two-Stage Self-Play GAZ PTP (TSS GAZ PTP), a novel DRL method designed to maintain adaptive and effective learning pressure throughout the training process. In the first stage, the learning agent, guided by Gumbel Monte Carlo Tree Search (MCTS), competes against a greedy opponent that follows the best historical policy. As training progresses, the framework transitions to a second stage in which both agents employ Gumbel MCTS, thereby establishing a dynamically balanced competitive environment that encourages continuous strategy refinement. The primary objective of this work is to develop a robust self-play mechanism capable of handling the high-dimensional constraints inherent in real-world routing problems. We first validate our approach on the TSP, a benchmark used in the original GAZ PTP study, and then extend it to the multi-constrained EVRP, which incorporates practical limitations including battery capacity, time windows, vehicle load limits, and charging infrastructure availability. The experimental results show that TSS GAZ PTP consistently outperforms existing DRL methods, with particularly notable improvements on large-scale instances.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wang et al. (Fri,) studied this question.

synapsesocial.com/papers/6975b28afeba4585c2d6e0dd https://doi.org/https://doi.org/10.3390/smartcities9020021

Bookmark

View Full Paper