Serverless computing offers a compelling cloud model for online inference services. However, existing serverless platforms lack efficient support for GPUs, hindering their ability to deliver high-performance inference. In this paper, we present Torpor , a serverless platform for GPU-efficient, low-latency inference. To enable efficient sharing of a node’s GPUs among numerous inference functions, Torpor maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding with model swapping). Torpor uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to minimize latency overhead caused by model swapping. Additionally, we design an interference-aware request scheduling algorithm that utilizes high-speed GPU interconnects to meet latency service-level objectives (SLOs) for individual inference functions. We have implemented Torpor and evaluated its performance in a production environment. Utilizing late binding and model swapping, Torpor can concurrently serve hundreds of inference functions on a worker node with 4 GPUs, while achieving latency performance comparable to native execution, where each model is cached exclusively on a GPU. Pilot deployment in a leading commercial serverless cloud shows that Torpor reduces the GPU provisioning cost by 70% and 65% for users and the platform, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Minchen Yu
Ao Wang
Bohui Wu
ACM Transactions on Architecture and Code Optimization
Hong Kong University of Science and Technology
Chinese University of Hong Kong, Shenzhen
Alibaba Group (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Yu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2bece4eeef8a2a6b0c96 — DOI: https://doi.org/10.1145/3800690