Transformer networks, driven by self-attention, are central to large languagemodels. In generative transformers, self-attention uses cache memoryto store token projections, avoiding recomputation at each time step.However, graphics processing unit (GPU)-stored projections must be loadedinto static random-access memory for each new generation step, causinglatency and energy bottlenecks. Here we present a custom self-attentionin-memory computing architecture based on emerging charge-basedmemories called gain cells, which can be efficiently written to store newtokens during sequence generation and enable parallel analog dot-productcomputation required for self-attention. However, the analog gain-cellcircuits introduce non-idealities and constraints preventing the directmapping of pre-trained models. To circumvent this problem, we design aninitialization algorithm achieving text-processing performance comparableto GPT-2 without training from scratch. Our architecture reduces attentionlatency and energy consumption by up to two and four orders of magnitude,respectively, compared with GPUs, marking a substantial step towardultrafast, low-power generative transformers
Building similarity graph...
Analyzing shared references across papers
Loading...
Nathan Leroux
Paul Manea
Chirag Sudarshan
Forschungszentrum Jülich
Building similarity graph...
Analyzing shared references across papers
Loading...
Leroux et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69730ef2c8125b09b0d1ed33 — DOI: https://doi.org/10.34734/fzj-2026-00225
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: