What question did this study set out to answer?

The study aims to improve the efficiency and speed of large language models through an innovative in-memory computing architecture.

January 23, 2026Open Access

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Key Points

The study aims to improve the efficiency and speed of large language models through an innovative in-memory computing architecture.
Developed a custom self-attention mechanism using gain cells for in-memory computing.
Designed an initialization algorithm to adapt existing models to the new architecture.
Compared performance of the new architecture against traditional GPU-based systems.
Achieved reduced attention latency by up to two orders of magnitude.
Decreased energy consumption by up to four orders of magnitude compared to GPUs.
Maintained comparable performance to GPT-2 without starting training from scratch.

Abstract

Transformer networks, driven by self-attention, are central to large languagemodels. In generative transformers, self-attention uses cache memoryto store token projections, avoiding recomputation at each time step.However, graphics processing unit (GPU)-stored projections must be loadedinto static random-access memory for each new generation step, causinglatency and energy bottlenecks. Here we present a custom self-attentionin-memory computing architecture based on emerging charge-basedmemories called gain cells, which can be efficiently written to store newtokens during sequence generation and enable parallel analog dot-productcomputation required for self-attention. However, the analog gain-cellcircuits introduce non-idealities and constraints preventing the directmapping of pre-trained models. To circumvent this problem, we design aninitialization algorithm achieving text-processing performance comparableto GPT-2 without training from scratch. Our architecture reduces attentionlatency and energy consumption by up to two and four orders of magnitude,respectively, compared with GPUs, marking a substantial step towardultrafast, low-power generative transformers

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Nathan Leroux

Paul Manea

Chirag Sudarshan

Actions

Institutions

Forschungszentrum Jülich

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Analog in-memory computing attention mechanism for fast and energy-efficient large language models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider