What type of study is this?

This is a Experimental Study study.

October 1, 2025Open Access

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

Key Points

Drift achieves an average 5.1× throughput improvement while meeting strict SLO targets.
The framework utilizes advanced GPU partitioning techniques to optimize the prefill and decode phases.
A unique gang scheduling mechanism and dispatching policy enhance performance across multi-turn workflows.
Drift resolves trade-offs in LLM serving by enabling effective in-place memory sharing and compute partitioning.

Abstract

Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average 5. 1 throughput improvement (up to 17. 5) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Cui et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68dd91c7fe798ba2fc49832c — DOI: https://doi.org/10.48550/arxiv.2504.14489

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Wenwen Cui

Y. Chen

Han Zhao

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Optimizing SLO-oriented LLM Serving with PD-Multiplexing

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion