Subgraph-based graph representation learning (SGRL) is an emerging class of GNN models that achieve much higher accuracy for various tasks than classical GNN models (e.g., GCN and GAT). However, we observe that serving SGRL models for online applications is challenging due to their irregular request workloads . Specifically, some heavy requests need significantly more computation than regular requests (e.g., 100x), leading to excessively long tail latency in existing systems. As such, we build SG-Serve, which tailors subgraph extraction and model inference (i.e., the two main stages of SGRL models) to handle the irregular request workloads. For subgraph extraction on the CPU, we propose a general API to implement the diverse extraction methods of SGRL models. Beside generality, the API also exposes parallelization opportunities and allows the heavy requests to utilize multiple threads for speedup. To schedule the CPU threads to conduct extraction for concurrent requests, we design a work-stealing policy, which enjoys parallelism while avoiding head-of-line blocking. For model inference on the GPU, we batch the requests according to their workload instead of request count (i.e., as in existing systems) and run two GPU processes to prevent the heavy requests from monopolizing the GPU. Our experiments show that compared with existing systems, SG-Serve can reduce the 99th percentile (P99) latency by over 13x and improve the request throughput by over 33x.
Building similarity graph...
Analyzing shared references across papers
Loading...
Qihui Zhou
Peiqi Yin
Xiaoli Yan
Proceedings of the ACM on Management of Data
Chinese University of Hong Kong
Wuhan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhou et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d895206c1944d70ce06196 — DOI: https://doi.org/10.1145/3786697