What question did this study set out to answer?

The aim is to improve low-latency inference in serverless platforms using GPU-efficient techniques.

April 15, 2026Open Access

Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping

Key Points

The aim is to improve low-latency inference in serverless platforms using GPU-efficient techniques.
Developed Torpor, a serverless platform for GPU-efficient inference.
Implemented model swapping in main memory and late binding for dynamic model deployment.
Utilized techniques such as asynchronous API redirection and pipelined model execution.
Designed an interference-aware request scheduling algorithm for optimal GPU utilization.
Torpor achieved latency performance comparable to native execution with fewer GPUs.
It can serve hundreds of inference functions simultaneously on a single worker node.
Pilot deployment demonstrated a 70% reduction in GPU provisioning costs for users and a 65% reduction for the platform.

Abstract

Serverless computing offers a compelling cloud model for online inference services. However, existing serverless platforms lack efficient support for GPUs, hindering their ability to deliver high-performance inference. In this paper, we present Torpor , a serverless platform for GPU-efficient, low-latency inference. To enable efficient sharing of a node’s GPUs among numerous inference functions, Torpor maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding with model swapping). Torpor uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to minimize latency overhead caused by model swapping. Additionally, we design an interference-aware request scheduling algorithm that utilizes high-speed GPU interconnects to meet latency service-level objectives (SLOs) for individual inference functions. We have implemented Torpor and evaluated its performance in a production environment. Utilizing late binding and model swapping, Torpor can concurrently serve hundreds of inference functions on a worker node with 4 GPUs, while achieving latency performance comparable to native execution, where each model is cached exclusively on a GPU. Pilot deployment in a leading commercial serverless cloud shows that Torpor reduces the GPU provisioning cost by 70% and 65% for users and the platform, respectively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Minchen Yu

Ao Wang

Bohui Wu

Journals

ACM Transactions on Architecture and Code Optimization

Actions

Institutions

Hong Kong University of Science and Technology

Chinese University of Hong Kong, Shenzhen

Alibaba Group (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Enabling Low-Latency, GPU-Efficient Serverless Inference with Model Swapping

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study