The scale of deep learning models has been growing rapidly, while hundred of TeraBytes data is also used to train them. Heterogeneous computing systems have become the dominant choice of parallel training platform. Heterogeneous nodes vary significantly in their computational abilities and the usage of resources is predominately dependent on task granularity, communication overhead and model complexity. In such environments, traditional static or heuristic-based scheduling methods may not be efficient enough with regard to task allocation and load balancing. In order to solve these difficulties, an adaptive scheduling strategy for parallel deep neural networks in heterogeneous computing environments is presented. By analyzing the states of nodes and the characteristics of tasks continuously, JCOHSS forms a dynamically tunable scheduling decision model to perform scalable collaborative scheduling in heterogeneous computational resources. This paper firstly studies the bottleneck of a parallel deep learning training job in a heterogenerus environment, indicating scheduling challenge in it from three aspects: resource capacity, task demand and limitations of previous strategies. Then, it devises an integrated scheduling framework and adaptive decision algorithm that includes a performance prediction model and dynamic feedback mechanism to improve scheduling accuracy and stability. Lastly, experiments on representative deep learning models and diverse systems demonstrate the efficacy of our proposed approach. Experimental results show that this strategy can notably decrease the amount of training, enhance the utilization efficiency of resources and have strong robustness on various heterogeneous structures. This research presents a scalable scheduling scheme to enhance deep learning training efficiency in the heterogeneous computing environment, which paves the way for future cross-platform collaborative computing and intelligent training systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hong Nie
Junwei Sun
Yirui Xu
IET conference proceedings.
Luye Pharma (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Nie et al. (Sun,) studied this question.
synapsesocial.com/papers/69ccb62016edfba7beb87be9 — DOI: https://doi.org/10.1049/icp.2026.0372