This paper addresses the uncertainties arising from dynamic environments in autonomous path planning tasks for unmanned aerial vehicles (UAVs). Existing Deep Reinforcement Learning (DRL) methods struggle to represent multiple feasible trajectories and often neglect failure experiences. To overcome these limitations, we propose a novel framework termed GRPO-MDP (Group Relative Policy Optimization with a Multimodal Diffusion Strategy). The framework leverages a Denoising Diffusion Probabilistic Model (DDPM) to generate diverse action samples during training. By integrating DRL with multimodal diffusion strategies, GRPO-MDP enables UAVs to explore multiple feasible trajectories from the same state, enhancing exploration flexibility. The proposed Group Relative Policy Optimization (GRPO) method constructs advantage functions through relative comparisons within groups of trajectories, thereby eliminating the need for a value network and improving training stability. In addition, a hindsight trajectory relabeling mechanism is introduced to convert failure experiences into informative learning signals by incorporating virtual target and safety boundary modes. To guarantee real-time safety, a Control Barrier Function (CBF) is employed as a safety filter, ensuring that hard safety constraints are strictly enforced during path planning. Experimental results demonstrate that the proposed GRPO-MDP framework outperforms existing methods in terms of success rate, path efficiency, and safety in dynamic environments.
Liu et al. (Wed,) studied this question.