August 5, 2024Open Access

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Key Points

Key points are not available for this paper at this time.

Abstract

As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present throttLL'eM, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. throttLL'eM features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, throttLL'eM manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves R² scores greater than 0. 97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that throttLL'eM achieves up to 43. 8\% lower energy consumption and an energy efficiency improvement of at least 1. 71 under SLOs, when compared to NVIDIA's Triton server.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Kakolyris et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e5d786b6db64358756d927 — DOI: https://doi.org/10.48550/arxiv.2408.05235

Authors

Andreas Kosmas Kakolyris

Dimosthenis Masouros

Petros Vavaroutsos

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SLO-aware GPU Frequency Scaling for Energy Efficient LLM Inference Serving

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider