What question did this study set out to answer?

To enhance the efficiency of serving subgraph-based graph representation learning models in online applications by addressing irregular request workloads.

April 10, 2026Open Access

SG-Serve: Efficient Model Serving for Subgraph-based Graph Representation Learning

Key Points

To enhance the efficiency of serving subgraph-based graph representation learning models in online applications by addressing irregular request workloads.
Developed SG-Serve framework for managing subgraph extraction and model inference.
Proposed an API for diverse subgraph extraction methods on CPU with parallelization.
Implemented a work-stealing policy for efficient scheduling of CPU threads during extraction.
Adopted a workload-based batching approach for GPU model inference to manage heavy requests.
Achieved over 13x reduction in 99th percentile latency compared to existing systems.
Increased request throughput by over 33x, significantly improving efficiency.

Abstract

Subgraph-based graph representation learning (SGRL) is an emerging class of GNN models that achieve much higher accuracy for various tasks than classical GNN models (e.g., GCN and GAT). However, we observe that serving SGRL models for online applications is challenging due to their irregular request workloads . Specifically, some heavy requests need significantly more computation than regular requests (e.g., 100x), leading to excessively long tail latency in existing systems. As such, we build SG-Serve, which tailors subgraph extraction and model inference (i.e., the two main stages of SGRL models) to handle the irregular request workloads. For subgraph extraction on the CPU, we propose a general API to implement the diverse extraction methods of SGRL models. Beside generality, the API also exposes parallelization opportunities and allows the heavy requests to utilize multiple threads for speedup. To schedule the CPU threads to conduct extraction for concurrent requests, we design a work-stealing policy, which enjoys parallelism while avoiding head-of-line blocking. For model inference on the GPU, we batch the requests according to their workload instead of request count (i.e., as in existing systems) and run two GPU processes to prevent the heavy requests from monopolizing the GPU. Our experiments show that compared with existing systems, SG-Serve can reduce the 99th percentile (P99) latency by over 13x and improve the request throughput by over 33x.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Qihui Zhou

Peiqi Yin

Xiaoli Yan

Journals

Proceedings of the ACM on Management of Data

Actions

Institutions

Chinese University of Hong Kong

Wuhan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SG-Serve: Efficient Model Serving for Subgraph-based Graph Representation Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study