What question did this study set out to answer?

This research aims to improve the performance of low-rank adaptation techniques for large language models on efficient computing architectures.

March 12, 2026Open Access

Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Key Points

This research aims to improve the performance of low-rank adaptation techniques for large language models on efficient computing architectures.
Propose a hybrid compute-in-memory architecture combining RRAM and SRAM.
Develop Hardware-aware Low-rank Adaptation (HaLoRA) to counteract noise from RRAM.
Theoretically analyze optimization trajectories to improve robustness under noisy conditions.
Achieved about 3% energy cost reduction compared to traditional Nvidia A100 GPU.
Improved average performance score by up to 22.7 on multiple reasoning tasks.
Maintained robustness against various noise types and levels.

Abstract

Low-rank adaptation (LoRA) is a predominant parameter-efficient finetuning method for adapting large language models (LLMs) to downstream tasks. Meanwhile, Compute-in-Memory (CIM) architectures demonstrate superior energy efficiency due to their array-level parallel in-memory computing designs. In this paper, we propose deploying the LoRA-finetuned LLMs on the hybrid CIM architecture (i.e., pretrained weights onto energy-efficient Resistive Random-Access Memory (RRAM) and LoRA branches onto noise-free Static Random-Access Memory (SRAM)), reducing the energy cost to about 3% compared to the Nvidia A100 GPU. However, the inherent noise of RRAM on the saved weights leads to performance degradation, simultaneously. To address this issue, we design a novel Hardware-aware Low-rank Adaptation (HaLoRA) method. The key insight is to train a LoRA branch that is robust toward such noise and then deploy it on noise-free SRAM, while the extra cost is negligible since the parameters of LoRAs are much fewer than pretrained weights (e.g., 0.15% for LLaMA-3.2 1B model). To improve the robustness towards the noise, we theoretically analyze the gap between the optimization trajectories of the LoRA branch under both ideal and noisy conditions and further design an extra loss to minimize the upper bound of this gap. Therefore, we can enjoy both energy efficiency and accuracy during inference. Experiments finetuning the Qwen and LLaMA series demonstrate the effectiveness of HaLoRA across multiple reasoning tasks, achieving up to 22.7 improvement in average score while maintaining robustness at various noise types and noise levels.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Social Feed

Authors

Taiqiang Wu

Chenchen Ding

Wei Zhou

Journals

ACM Transactions on Design Automation of Electronic Systems

Actions

Institutions

University of Hong Kong

Tsinghua University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hardware-aware Low-Rank Adaptation for Large Language Models Based on Hybrid Compute-in-Memory Architecture

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Social Feed

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider