Key points are not available for this paper at this time.
As Large Language Models (LLMs) gain traction, their reliance on power-hungry GPUs places ever-increasing energy demands, raising environmental and monetary concerns. Inference dominates LLM workloads, presenting a critical challenge for providers: minimizing energy costs under Service-Level Objectives (SLOs) that ensure optimal user experience. In this paper, we present throttLL'eM, a framework that reduces energy consumption while meeting SLOs through the use of instance and GPU frequency scaling. throttLL'eM features mechanisms that project future KV cache usage and batch size. Leveraging a Machine-Learning (ML) model that receives these projections as inputs, throttLL'eM manages performance at the iteration level to satisfy SLOs with reduced frequencies and instance sizes. We show that the proposed ML model achieves R² scores greater than 0. 97 and miss-predicts performance by less than 1 iteration per second on average. Experimental results on LLM inference traces show that throttLL'eM achieves up to 43. 8\% lower energy consumption and an energy efficiency improvement of at least 1. 71 under SLOs, when compared to NVIDIA's Triton server.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kakolyris et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e5d786b6db64358756d927 — DOI: https://doi.org/10.48550/arxiv.2408.05235
Andreas Kosmas Kakolyris
Dimosthenis Masouros
Petros Vavaroutsos
Building similarity graph...
Analyzing shared references across papers
Loading...
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: