Deep reinforcement learning (DRL) with self-play has emerged as a promising paradigm for solving combinatorial optimization (CO) problems. The recently proposed Gumbel AlphaZero Plan-to-Play (GAZ PTP) framework adopts a competitive training setup between a learning agent and an opponent to tackle classical CO tasks such as the Traveling Salesman Problem (TSP). However, in complex and multi-constrained environments like the Electric Vehicle Routing Problem (EVRP), standard self-play often suffers from opponent mismatch: when the opponent is either too weak or too strong, the resulting learning signal becomes ineffective. To address this challenge, we introduce Two-Stage Self-Play GAZ PTP (TSS GAZ PTP), a novel DRL method designed to maintain adaptive and effective learning pressure throughout the training process. In the first stage, the learning agent, guided by Gumbel Monte Carlo Tree Search (MCTS), competes against a greedy opponent that follows the best historical policy. As training progresses, the framework transitions to a second stage in which both agents employ Gumbel MCTS, thereby establishing a dynamically balanced competitive environment that encourages continuous strategy refinement. The primary objective of this work is to develop a robust self-play mechanism capable of handling the high-dimensional constraints inherent in real-world routing problems. We first validate our approach on the TSP, a benchmark used in the original GAZ PTP study, and then extend it to the multi-constrained EVRP, which incorporates practical limitations including battery capacity, time windows, vehicle load limits, and charging infrastructure availability. The experimental results show that TSS GAZ PTP consistently outperforms existing DRL methods, with particularly notable improvements on large-scale instances.
Wang et al. (Fri,) studied this question.