The multi-depot vehicle routing problem with soft time windows (MDVRPSTW) has long been a focus in both academic and industrial circles. This paper proposes a deep reinforcement learning framework designed to enhance the efficiency and quality of MDVRPSTW solutions, addressing the limitations of traditional heuristic algorithms in large-scale complex scenarios. The framework first transforms the mathematical model into a sequential decision-making problem through a Markov decision process, then extracts path selection strategies using an encoder–decoder architecture based on attention mechanisms and graph neural networks, and employs unsupervised reinforcement learning for model training. Test results on the Solomon benchmark dataset demonstrate that for small-scale problems (N = 20), our method reduces solving time by over 96% compared to comparative algorithms, with the objective value difference from the generalized variable neighborhood search (GVNS) being less than 9%. For medium-to-large scale problems (N = 50/100), our method achieves a 27.7 to 96.3 percent improvement over GVNS, maintaining stable solution times within 3 to 10 s. Compared to exact algorithms and meta-heuristic methods, our approach reduces computational costs by 2–3 orders of magnitude while demonstrating strong adaptability to variations in the number of depots and vehicles. In summary, this method significantly outperforms baseline models in both solution quality and computational efficiency, providing an efficient end-to-end solution for MDVRPSTW in complex scenarios.
Chen et al. (Sat,) studied this question.