Off-policy reinforcement learning is usually used to train the grasping task model of the manipulator. However, in the training process, it is difficult to collect enough successful experience data and rewards for learning and training; that is, there is a problem of sparse rewards. Hindsight experience replay (HER) allows the agent to relabel the completed states. However, not all failed experiences have the same effect on learning and training. Facing the many transitions generated by the environment during operation, adopting a random uniform sampling method from the experience replay buffer will result in low data utilization and slow convergence. This paper proposes using a prioritized sampling method to sample the relabelled transitions, and then combines various off-policy reinforcement learning algorithms with it for training in simulated environments. This paper uses the prioritized sampling method, which allows the agent to access more important transitions earlier and accelerate the convergence of training. The results demonstrate that hindsight experience replay with prioritization (PHER) exhibits significantly faster convergence compared to other methods.
Zhang et al. (Thu,) studied this question.