Retrieval-augmented generation and vector embedding systems treat memory as something that lives outside the model, in a separate database that gets stitched into the prompt at inference time. This paper takes a different approach.We show that a small transformer based on the BDH architecture has a parameter slot, structurally separate from the backbone, that can be written to at inference time using a few hundred gradient steps, saved to disk, reloaded in a fresh process, and queried. None of this requires putting the original document back into the context window.We demonstrate this on a model with approximately 30M parameters trained from scratch on 250M tokens of FineWeb-Edu, with content-based addressing and meta-test-time-training. After training, the model can hold 20 distinct fabricated facts simultaneously in its slow memory parameter block, retrieve all 20 with median p>0.99, and survive cold reload. Two independent fictional documents (212 and 205 tokens) injected sequentially coexist with 71% retention. The mechanism is reproducible, well-characterised, and the entire training run fits on a single consumer GPU. The paper documents how the architecture got there: the spontaneous gate specialisation that turned BDH into a two-system memory hierarchy resembling complementary learning systems theory; the gradient routing phenomenon where adding a Hebbian write site at one layer doubles the gradient investment that backpropagation makes in the slow weights at that same layer; the discovery that joint contrastive encoding is the correct write protocol; and several negative results that took us in the right direction by ruling out the wrong ones. We position the work relative to concurrent test-time training research and explain what we have shown versus what we have not. This work was done over a few weekends and a couple of weeks of late nights on a laptop GPU.
Building similarity graph...
Analyzing shared references across papers
Loading...
Russell THOMAS (Mon,) studied this question.
www.synapsesocial.com/papers/69e07dad2f7e8953b7cbea74 — DOI: https://doi.org/10.5281/zenodo.19565707
Russell THOMAS
Building similarity graph...
Analyzing shared references across papers
Loading...