What question did this study set out to answer?

The aim is to evaluate and optimize storage architectures for efficient performance in multi-GPU systems focusing on large language models.

March 28, 2026

Towards Scalable Storage Architectures for GPU Clusters Running Large Language Models

Key Points

The aim is to evaluate and optimize storage architectures for efficient performance in multi-GPU systems focusing on large language models.
Conducted a phase-aware evaluation of datapath performance for LLMs across various storage media.
Utilized technologies including libaio, io_uring, SPDK, and GDS in a systematic analysis.
Explored over 25,000 configurations to measure performance metrics such as throughput and IOPS.
io_uring demonstrated the lowest latency and competitive IOPS for small random I/O on NVMe during inference.
GDS was effective in reducing load times for pre-training and fine-tuning workloads, dominated by sequential reads and writes.
CPU efficiency emerged as a critical differentiator, emphasizing the need for alignment of datapath choice with the LLM pipeline phase.

Abstract

In modern multi-accelerator nodes, GPU throughput is increasingly constrained by storage and I/O bottlenecks, leaving accelerators idle as data transfer is restricted by software. In this study, we focus on single-node, multi-GPU systems with small-to-medium-scale models and present a systematic, phase-aware evaluation of datapath performance for Large Language Models (LLMs), covering technologies such as in-kernel libaio, hybrid user-kernel ioᵤring, user-space NVMe via the Storage Performance Development Kit (SPDK), and GPUDirect Storage (GDS). These approaches are evaluated across various storage media including SATA Solid State Drives (SSDs), NVMe SSDs, Optane NVMe, and Optane Persistent Memory (PMem). Leveraging an automated evaluation framework, we explore over 25, 000 configurations, measuring throughput, latency, I/O per second (IOPS), and CPU cost. Our study offers LLM storage scenarios in both standardized benchmarks and real-world production traces, ensuring that our workload models accurately reflect the I/O demands across pre-training, fine-tuning, and inference. We find that for inference, ioᵤring achieves the lowest latency and competitive IOPS for small random I/O on NVMe. In contrast, SPDK is limited to raw block-device evaluation due to its lack of POSIX file-system support. For pre-training and fine-tuning, workloads are dominated by coarse-grained sequential reads and writes, where GDS excels in reducing load times and host CPU usage. Among CPU-mediated datapaths, CPU efficiency—measured as GB/s per core—emerges as the key differentiator. Taken together, these results yield actionable design guidelines: align the choice of datapath with the LLM pipeline phase. Use ioᵤring for inference to optimize data transfer efficiency and minimize latency, and leverage GDS for pre-training and fine-tuning to improve throughput per core, thereby narrowing the storage-to-compute gap in GPU LLM clusters.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Social Feed

Authors

Ali Sedaghatgoo

Reza Salkhordeh

André Brinkmann

Journals

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Actions

Institutions

Johannes Gutenberg University Mainz

Saarland University

Sharif University of Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sedaghatgoo et al. (Thu,) studied this question.

www.synapsesocial.com/papers/69c771dd8bbfbc51511e1e3b — DOI: https://doi.org/10.1145/3788106

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Enriching Location Representation with Detailed Semantic Information· 2024 · 348 citations
Gemma: Open Models Based on Gemini Research and Technology· 2024 · 222 citations
Architectural and System Implications of CXL-enabled Tiered Memory· 2025 · 1 citations
Analysing Off-The-Shelf Options for Question Answering with Portuguese FAQs

Towards Scalable Storage Architectures for GPU Clusters Running Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Social Feed

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider