What question did this study set out to answer?

The aim is to optimize GPU throughput under power constraints for deep learning inference services.

April 15, 2026Open Access

PctoDL: Adaptive GPU Throughput Optimization for Deep Learning Inference with Power Constraints

Key Points

The aim is to optimize GPU throughput under power constraints for deep learning inference services.
Developed PctoDL for joint optimization of resource partitioning, batch size, and frequencies.
Implement a physics-informed iterative greedy partitioning algorithm.
Utilized a thermodynamic model-predictive controller for frequency regulation during runtime.
Employed an online joint optimization approach for adaptive scheduling.
Achieved 108.41% average throughput improvement on RTX 3080 Ti compared to BatchDVFS.
Peak gain of 262.74% on RTX 3080 Ti.
Delivered 19.74% average gain on NVIDIA A100 with a maximum gain of 57.03%.
Surpassed Morak’s approach with 79.05% average gain on RTX 3080 Ti.

Abstract

The proliferation of deep learning inference services in power-constrained environments necessitates GPU management strategies that maximize throughput within strict power envelopes. Existing approaches often treat frequency scaling and resource partitioning as orthogonal problems or rely on static hardware assumptions, leading to suboptimal energy efficiency. This paper presents PctoDL , a power-aware scheduling system that maximizes aggregate inference throughput by jointly optimizing spatial resource partitioning, batch size, and SM/memory frequency settings. To address the throughput–power tradeoff in power-constrained multi-tenant inference, PctoDL couples resource partitioning with coordinated frequency control under a fixed power cap. It combines a physics-informed iterative greedy partitioning algorithm, a thermodynamic model-predictive controller for runtime frequency regulation, and an online joint optimization mechanism for adaptive refinement. On the NVIDIA RTX 3080 Ti platform, PctoDL improves average throughput over BatchDVFS by 108.41%, with a peak gain of 262.74%. On the NVIDIA A100 platform, it delivers an average gain of 19.74% and a maximum gain of 57.03%. Compared with Morak’s coarse-grained partitioning approach, PctoDL achieves average/peak gains of 79.05%/137.93% on the RTX 3080 Ti and 26.33%/70.21% on the A100.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Hao et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69df2b04e4eeef8a2a6aff57 — DOI: https://doi.org/10.1145/3805802

Authors

Meng Hao

Zikun Wu

Xu Tian

Journals

ACM Transactions on Architecture and Code Optimization

Actions

Institutions

Harbin Institute of Technology

DigitalSpace (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

PctoDL: Adaptive GPU Throughput Optimization for Deep Learning Inference with Power Constraints

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion