August 6, 2025Open Access

Efficient Inference of Large Language Models through Model Compression

Key Points

Model compression techniques significantly enhance the inference efficiency of large language models.
Five major paradigms of model compression include pruning, quantization, and knowledge distillation.
Evaluation metrics like latency and energy consumption are crucial for assessing compressed model performance.
Challenges such as performance degradation and ethical concerns persist with aggressive model compression.

Abstract

The increasing scale and complexity of large language models (LLMs) have revolutionized natural language processing (NLP), driving remarkable progress in a wide range of tasks such as machine translation, text summarization, sentiment analysis, and conversational AI. However, these performance gains have come at a substantial cost: modern LLMs often require billions of parameters, resulting in excessive computational demands, extensive memory footprints, and increased energy consumption. Such challenges significantly hinder the deployment of these models in resource-constrained environments, including mobile devices, edge platforms, and cost-sensitive cloud infrastructure. Consequently, model compression techniques have emerged as a vital solution for enhancing inference efficiency, reducing resource consumption, and accelerating model deployment without compromising task performance. This survey provides a comprehensive and structured overview of model compression techniques specifically designed to improve the efficiency of large language models. We categorize existing approaches into five major paradigms: pruning, quantization, knowledge distillation, low-rank approximation, and neural architecture search (NAS). For each category, we examine the underlying theoretical principles, highlight state-of-the-art methods, and discuss their practical implementations. We further analyze the trade-offs between compression ratios, inference speed, and model accuracy, shedding light on the suitability of different approaches for various real-world scenarios. In addition to covering core compression techniques, this survey explores evaluation metrics that extend beyond traditional accuracy measures. We discuss latency, throughput, memory efficiency, and energy consumption as crucial performance indicators for compressed models. Furthermore, we examine deployment strategies that integrate compressed models into diverse environments, including cloud-based services, edge devices, and on-premises infrastructure. We outline best practices for efficient serving frameworks, scalable resource management, and hardware-aware optimization techniques to maximize the benefits of compressed models in production systems. Despite significant advancements, the field of model compression faces persistent challenges. Aggressive compression can lead to catastrophic performance degradation, loss of generalization capabilities, and increased vulnerability to adversarial attacks. Furthermore, compressed models may unintentionally amplify biases present in the original models, posing ethical concerns in sensitive applications such as healthcare, finance, and legal decision-making. Additionally, deploying compressed models across heterogeneous hardware platforms presents new challenges in optimizing performance while maintaining compatibility. To address these challenges, we identify promising research directions aimed at improving model compression techniques. We highlight the need for adaptive compression frameworks that dynamically adjust model complexity based on real-time conditions. We also emphasize the importance of energy-efficient compression methods that minimize the environmental impact of large-scale AI systems. Furthermore, we advocate for developing fairness-aware compression techniques that prioritize the retention of features crucial for minority groups and marginalized populations. Finally, we encourage the creation of standardized evaluation benchmarks that provide holistic assessments of compressed models’ accuracy, robustness, latency, and resource efficiency. This survey aims to equip researchers and practitioners with a comprehensive understanding of model compression techniques, empowering them to design, evaluate, and deploy efficient language models that meet the growing demands of modern NLP applications. By addressing both the technical foundations and practical deployment considerations, this work provides a valuable resource for advancing the development of scalable, cost-effective, and accessible AI systems. Through continued innovation in compression methods, the NLP community can build robust, environmentally sustainable models that unlock the potential of language AI across diverse domains.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

J. Whitmore

C. Nicholas Hastings

Amir Patel

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Efficient Inference of Large Language Models through Model Compression

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study