What question did this study set out to answer?

The aim is to demonstrate a persistent writable memory mechanism within a transformer model architecture.

April 16, 2026Open Access

BDH-Memory: Persistent, Writable Knowledge Storage in a Transformer Trained on a Laptop

Key Points

The aim is to demonstrate a persistent writable memory mechanism within a transformer model architecture.
Developed a small transformer based on the BDH architecture
Trained on 250M tokens of FineWeb-Edu
Implemented content-based addressing and meta-test-time-training
Used gradient steps to write to a separate memory parameter during inference
The model can hold 20 facts with a median retrieval accuracy of p>0.99
Facts are retained with 71% accuracy after sequentially injecting two fictional documents
The training process was conducted entirely on a single consumer GPU

Abstract

Retrieval-augmented generation and vector embedding systems treat memory as something that lives outside the model, in a separate database that gets stitched into the prompt at inference time. This paper takes a different approach.We show that a small transformer based on the BDH architecture has a parameter slot, structurally separate from the backbone, that can be written to at inference time using a few hundred gradient steps, saved to disk, reloaded in a fresh process, and queried. None of this requires putting the original document back into the context window.We demonstrate this on a model with approximately 30M parameters trained from scratch on 250M tokens of FineWeb-Edu, with content-based addressing and meta-test-time-training. After training, the model can hold 20 distinct fabricated facts simultaneously in its slow memory parameter block, retrieve all 20 with median p>0.99, and survive cold reload. Two independent fictional documents (212 and 205 tokens) injected sequentially coexist with 71% retention. The mechanism is reproducible, well-characterised, and the entire training run fits on a single consumer GPU. The paper documents how the architecture got there: the spontaneous gate specialisation that turned BDH into a two-system memory hierarchy resembling complementary learning systems theory; the gradient routing phenomenon where adding a Hebbian write site at one layer doubles the gradient investment that backpropagation makes in the slow weights at that same layer; the discovery that joint contrastive encoding is the correct write protocol; and several negative results that took us in the right direction by ruling out the wrong ones. We position the work relative to concurrent test-time training research and explain what we have shown versus what we have not. This work was done over a few weekends and a couple of weeks of late nights on a laptop GPU.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Russell THOMAS (Mon,) studied this question.

www.synapsesocial.com/papers/69e07dad2f7e8953b7cbea74 — DOI: https://doi.org/10.5281/zenodo.19565707

BDH-Memory: Persistent, Writable Knowledge Storage in a Transformer Trained on a Laptop

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion