What question did this study set out to answer?

The aim is to systematically analyze performance-energy trade-offs in large language model inference for multi-request workflows.

March 28, 2026

Characterizing Performance–Energy Trade-offs of Large Language Models in Multi-Request Workflows

Key Points

The aim is to systematically analyze performance-energy trade-offs in large language model inference for multi-request workflows.
Characterization of four representative workloads: sequential, interactive, agentic, and composite.
Use of an empirical NVIDIA A100 testbed with serving systems like vLLM and Parrot.
Analysis of key energy knobs including input-output length, batch size, and GPU power cap.
Batch size significantly affects performance but depends on the workload type.
GPU power capping results in modest energy savings.
Output length increases energy use linearly with limited efficiency benefits.

Abstract

Large language models (LLMs) are increasingly deployed in applications forming multi-request workflows like document summarization, search-based copilots, and multi-agent programming. While these workflows unlock richer functionality, they also amplify latency and energy demand during inferences. Existing measurement and benchmarking efforts either focus on assessing performance of LLM inference systems or consider single-request evaluations, overlooking workflow dependencies and cross-request interactions unique to multi-request workflows. Moreover, the energy usage of such interdependent LLM calls is not explored in-depth. To address these gaps, this paper presents the first systematic characterization of performance--energy trade-offs in multi-request LLM inference. We develop and evaluate four representative workloads that capture sequential, interactive, agentic, and composite patterns common in modern deployments. Using an empirical NVIDIA A100 testbed with state-of-the-art serving systems (vLLM and Parrot), we systematically analyze how key energy knobs (e.g., input-output length, batch size, and GPU power cap) reshape latency, throughput, and component-level (e.g., CPU, GPU, and DRAM) energy use. Our findings reveal that batch size is the most impactful lever, though its benefits are highly workload dependent. While optimal batching benefits workloads with large shared prompts, it is ineffective for sequential summarization and only partially effective for mutli-agent coding. GPU power capping provides modest but predictable savings, while output length induces linear energy scaling with limited efficiency gains. We further demonstrate that engine-level optimizations in vLLM (e.g., continuous batching, PagedAttention) maintain higher GPU utilization and efficiency, especially for decode-heavy workloads, while Parrot's workflow-aware scheduling achieves lower energy consumption under stringent power constraints. These findings offer actionable guidelines for developers and system operators in designing performance- and energy-aware LLM serving systems in emerging multi-request workflows.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Md. Monzurul Amin Ifath

Israat Haque

Journals

Proceedings of the ACM on Measurement and Analysis of Computing Systems

Actions

Institutions

Dalhousie University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Characterizing Performance–Energy Trade-offs of Large Language Models in Multi-Request Workflows

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider