Stragglers remain an open challenge in large model training, forcing other GPUs to wait at synchronization barriers. While severe, persistent stragglers have been actively studied, their transient counterparts are often overlooked due to their short-lived nature. We reveal that the interplay between the transient straggler effects and network contention drastically amplifies the relative delays across workers, significantly slowing down training iterations. To address this, an intuitive approach is to prioritize bandwidth for the straggler, allowing it to catch up to the front-runners. However, while the transport layer is well-positioned to accommodate such responsive bandwidth control, conventional designs lack visibility into process-level progress. To bridge this gap, we propose PRC (Process-centric Rate Control), a new sending-rate control designed to mitigate the transient straggler effects. PRC adjusts NIC sending rates by inferring local GPU process-level information, enabling transient straggler processes to utilize more bandwidth on time. Extensive experiments on both a real-world cluster and large-scale simulations confirm that PRC effectively accelerates transient stragglers, achieving a training speedup of up to 28% compared to using state-of-the-art datacenter congestion control schemes.
Han et al. (Mon,) studied this question.