In modern multi-accelerator nodes, GPU throughput is increasingly constrained by storage and I/O bottlenecks, leaving accelerators idle as data transfer is restricted by software. In this study, we focus on single-node, multi-GPU systems with small-to-medium-scale models and present a systematic, phase-aware evaluation of datapath performance for Large Language Models (LLMs), covering technologies such as in-kernel libaio, hybrid user-kernel ioᵤring, user-space NVMe via the Storage Performance Development Kit (SPDK), and GPUDirect Storage (GDS). These approaches are evaluated across various storage media including SATA Solid State Drives (SSDs), NVMe SSDs, Optane NVMe, and Optane Persistent Memory (PMem). Leveraging an automated evaluation framework, we explore over 25, 000 configurations, measuring throughput, latency, I/O per second (IOPS), and CPU cost. Our study offers LLM storage scenarios in both standardized benchmarks and real-world production traces, ensuring that our workload models accurately reflect the I/O demands across pre-training, fine-tuning, and inference. We find that for inference, ioᵤring achieves the lowest latency and competitive IOPS for small random I/O on NVMe. In contrast, SPDK is limited to raw block-device evaluation due to its lack of POSIX file-system support. For pre-training and fine-tuning, workloads are dominated by coarse-grained sequential reads and writes, where GDS excels in reducing load times and host CPU usage. Among CPU-mediated datapaths, CPU efficiency—measured as GB/s per core—emerges as the key differentiator. Taken together, these results yield actionable design guidelines: align the choice of datapath with the LLM pipeline phase. Use ioᵤring for inference to optimize data transfer efficiency and minimize latency, and leverage GDS for pre-training and fine-tuning to improve throughput per core, thereby narrowing the storage-to-compute gap in GPU LLM clusters.
Building similarity graph...
Analyzing shared references across papers
Ali Sedaghatgoo
Reza Salkhordeh
André Brinkmann
Proceedings of the ACM on Measurement and Analysis of Computing Systems
Johannes Gutenberg University Mainz
Saarland University
Sharif University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Sedaghatgoo et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69c771dd8bbfbc51511e1e3b — DOI: https://doi.org/10.1145/3788106
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: