Modern LLM services demand high throughput and stringent SLO guarantees across two distinct inference phases-prefill and decode-and complex multi-turn workflows. However, current systems face a fundamental tradeoff: out-of-place compute partition enables per-phase SLO attainment, while in-place memory sharing maximizes throughput via KV cache reuse. Moreover, existing in-place compute partition also encounters low utilization and high overhead due to phase-coupling design. We present Drift, a new LLM serving framework that resolves this tension via PD multiplexing, enabling in-place and phase-decoupled compute partition. Drift leverages low-level GPU partitioning techniques to multiplex prefill and decode phases spatially and adaptively on shared GPUs, while preserving in-place memory sharing. To fully leverage the multiplexing capability, Drift introduces an adaptive gang scheduling mechanism, a contention-free modeling method, and a SLO-aware dispatching policy. Evaluation shows that Drift achieves an average 5. 1 throughput improvement (up to 17. 5) over state-of-the-art baselines, while consistently meeting SLO targets under complex LLM workloads.
Building similarity graph...
Analyzing shared references across papers
Loading...
Cui et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68dd91c7fe798ba2fc49832c — DOI: https://doi.org/10.48550/arxiv.2504.14489
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Wenwen Cui
Y. Chen
Han Zhao
Building similarity graph...
Analyzing shared references across papers
Loading...