Transformer networks, driven by self-attention, are central to large languagemodels. In generative transformers, self-attention uses cache memoryto store token projections, avoiding recomputation at each time step.However, graphics processing unit (GPU)-stored projections must be loadedinto static random-access memory for each new generation step, causinglatency and energy bottlenecks. Here we present a custom self-attentionin-memory computing architecture based on emerging charge-basedmemories called gain cells, which can be efficiently written to store newtokens during sequence generation and enable parallel analog dot-productcomputation required for self-attention. However, the analog gain-cellcircuits introduce non-idealities and constraints preventing the directmapping of pre-trained models. To circumvent this problem, we design aninitialization algorithm achieving text-processing performance comparableto GPT-2 without training from scratch. Our architecture reduces attentionlatency and energy consumption by up to two and four orders of magnitude,respectively, compared with GPUs, marking a substantial step towardultrafast, low-power generative transformers
Building similarity graph...
Analyzing shared references across papers
Loading...
Leroux et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69730ef2c8125b09b0d1ed33 — DOI: https://doi.org/10.34734/fzj-2026-00225
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Nathan Leroux
Paul Manea
Chirag Sudarshan
Forschungszentrum Jülich
Building similarity graph...
Analyzing shared references across papers
Loading...