What question did this study set out to answer?

The aim is to create a GPU cluster scheduler that learns from data without relying on fixed rules, addressing limitations of existing methods.

April 10, 2026Open Access

LLM-based GPU Cluster Scheduler

Key Points

The aim is to create a GPU cluster scheduler that learns from data without relying on fixed rules, addressing limitations of existing methods.
Utilized LlamaRec's LLMRanker as Job Sorter for scheduling tasks.
Trained on SJF policy data generated from Alibaba GPU task traces.
Allowed for relearning through data addition and replacement.
Achieved high ranking prediction performance in NDCG and Kendall's tau metrics.
Matched performance of SJF in job completion time, waiting time, makespan, and GPU utilization.

Abstract

기존 First-In First-Out(FIFO), Shortest Job First(SJF) 등의 규칙 기반 GPU 스케줄러는 전문가의 도메인 지식에 의존하며, 복잡한 작업 구성이나 비정형적 시스템 환경 변화에 유연하게 대응하기 어렵다. 본 연구에서는 이러한 한계를 극복하기 위해, 명시적 규칙 설계 없이 스케줄링 데이터로부터 정책을 학습하는 Large Language Model(LLM) 기반 GPU 클러스터 스케줄러를 제안한다. 제안 방법은 LlamaRec의 LLMRanker(Llama-2-7B, LoRA 기반 PEFT)를 Job Sorter로 활용하며, 알리바바 GPU 작업 트레이스 기반 시뮬레이터에서 생성된 SJF 정책 데이터로 학습한다. 이 접근법은 규칙을 재설계할 필요 없이 데이터 추가 및 교체를 통한 재학습만으로 환경 변화에 대응할 수 있으며, 데이터 품질과 양에 따라 성능 향상도 기대할 수 있다. 실험 결과, LLMRanker는 Normalized Discounted Cumulative Gain(NDCG) 및 Kendall's tau 지표에서 높은 순위 예측 성능을 보였으며, GPU 클러스터 시뮬레이션에서 Job Completion Time(JCT), 대기 시간, makespan, GPU 활용률 모두 SJF와 동등한 수준을 달성하였다.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Hwang et al. (Tue,) studied this question.

synapsesocial.com/papers/69d892d16c1944d70ce04052 https://doi.org/https://doi.org/10.6109/jkiice.2026.30.3.453

Bookmark

View Full Paper