The proliferation of deep learning inference services in power-constrained environments necessitates GPU management strategies that maximize throughput within strict power envelopes. Existing approaches often treat frequency scaling and resource partitioning as orthogonal problems or rely on static hardware assumptions, leading to suboptimal energy efficiency. This paper presents PctoDL , a power-aware scheduling system that maximizes aggregate inference throughput by jointly optimizing spatial resource partitioning, batch size, and SM/memory frequency settings. To address the throughput–power tradeoff in power-constrained multi-tenant inference, PctoDL couples resource partitioning with coordinated frequency control under a fixed power cap. It combines a physics-informed iterative greedy partitioning algorithm, a thermodynamic model-predictive controller for runtime frequency regulation, and an online joint optimization mechanism for adaptive refinement. On the NVIDIA RTX 3080 Ti platform, PctoDL improves average throughput over BatchDVFS by 108.41%, with a peak gain of 262.74%. On the NVIDIA A100 platform, it delivers an average gain of 19.74% and a maximum gain of 57.03%. Compared with Morak’s coarse-grained partitioning approach, PctoDL achieves average/peak gains of 79.05%/137.93% on the RTX 3080 Ti and 26.33%/70.21% on the A100.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hao et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b04e4eeef8a2a6aff57 — DOI: https://doi.org/10.1145/3805802
Meng Hao
Zikun Wu
Xu Tian
ACM Transactions on Architecture and Code Optimization
Harbin Institute of Technology
DigitalSpace (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...