What type of study is this?

This is a Experimental Study study.

October 1, 2025Open Access

SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

Key Points

SeaLLM reduces overall latency by up to 13.60 times compared to existing solutions, enhancing user experience.
The tail latency improves significantly, with a reduction of up to 18.69 times, indicating better handling of peak requests.
A unified key-value cache efficiently shares GPU memory among large language model services, optimizing resource usage.
Evaluation showed substantial gains in service-level objective attainment, improving by up to 3.64 times, thereby ensuring reliable performance.

Abstract

Large language models (LLMs) with different architectures and sizes have been developed. Serving each LLM with dedicated GPUs leads to resource waste and service inefficiency due to the varying demand of LLM requests. A common practice is to share multiple LLMs. However, existing sharing systems either do not consider the autoregressive pattern of LLM services, or only focus on improving the throughput, which impairs the sharing performance, especially the serving latency. We present SeaLLM, which enables service-aware and latency-optimized LLM sharing. SeaLLM improves the overall sharing performance by (1) a latency-optimized scheduling algorithm utilizing the characteristics of LLM services, (2) a placement algorithm to determine the placement plan and an adaptive replacement algorithm to decide the replacement interval, and (3) a unified key-value cache to share GPU memory among LLM services efficiently. Our evaluation under real-world traces and LLM services demonstrates that SeaLLM improves the normalized latency by up to 13. 60, the tail latency by up to 18. 69, and the SLO attainment by up to 3. 64 compared to existing solutions.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhao et al. (Tue,) studied this question.

www.synapsesocial.com/papers/68dd91c7fe798ba2fc498612 — DOI: https://doi.org/10.48550/arxiv.2504.15720

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

ScaleLLM: A Resource-Frugal LLM Serving Framework by Optimizing End-to-End Efficiency· 2024 · 1 citations
Optimizing Resource Allocation for Geographically-Distributed Inference by Large Language Models· 2026
BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models· 2024
SAGESERVE: Optimizing LLM Serving on Cloud Data Centers with Forecast Aware Auto-Scaling· 2025 · 1 citations

Authors

Yihao Zhao

Jiadun Chen

Peng Sun

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SeaLLM: Service-Aware and Latency-Optimized Resource Sharing for Large Language Model Inference

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion