TemporalMesh Transformer (TMT) - is a novel autoregressive language model architecture that simultaneously resolves three fundamental inefficiencies shared by every standard transformer design since Vaswani et al. (2017). Despite nearly a decade of scaling and refinement, the vanilla transformer still makes three assumptions that have remained largely unchanged: every token attends to every other token regardless of relevance, the attention graph is flat and fully connected with no structural awareness, and every token — whether a comma or a rare technical term — spends identical compute traversing all N layers. TMT breaks all three of these assumptions at once, in a single unified forward pass, through five tightly coupled architectural innovations. INNOVATION 1 — MESH ATTENTION (DYNAMIC GRAPH TOPOLOGY) Standard transformers compute full O (S²) attention over all token pairs. TMT replaces this with a dynamic sparse graph. At the start of every layer, a MeshBuilder module computes pairwise cosine similarity between the current token representations and retains only the top-k nearest neighbours per token (k=8 by default), forming a sparse edge index. Attention then flows exclusively along these edges, reducing the per-layer attention cost from O (S²·d) to O (S·k·d) — a 128× reduction at sequence length S=1024. Crucially, this graph is not fixed or pre-defined. After each TMT layer, the token representations change, so the graph is rebuilt from scratch. Topology is a live, emergent property of the current forward pass, adapting to what the tokens semantically mean right now. No prior Graph Transformer does this: all existing graph-aware architectures use a static topology defined by the input data, not by the model's own intermediate representations. INNOVATION 2 — TEMPORAL DECAY ENCODING Existing positional encodings — sinusoidal, learned absolute, RoPE, ALiBi — tell a token where it is in the sequence. None of them tell the model how relevant a distant token is to the current prediction target. TMT introduces Temporal Decay Encoding: a learned per-head scalar multiplied directly into the post-softmax attention weights, attenuating tokens that are semantically far from the current query. The decay function is sigmoid (Wdecay · |tᵢ − tⱼ|), where t ∈ 0, 1 is the normalised position and Wdecay is a learned scalar per attention head. Unlike ALiBi (which adds a fixed linear bias to logits before softmax), TMT decay is multiplicative, applied after softmax normalisation, fully learned end-to-end, and jointly optimised with all other model parameters. The result is that recent, semantically relevant tokens stay loud while distant, irrelevant tokens fade — without any recurrence, without any hidden state, and without any fixed schedule. INNOVATION 3 — ADAPTIVE DEPTH ROUTING (PER-TOKEN EARLY EXIT) In every existing transformer — GPT, LLaMA, Mistral, and all their derivatives — a comma and a rare scientific term spend the exact same compute: all N layers, unconditionally. TMT introduces an ExitGate after every layer that assigns each token a scalar confidence score via a single linear projection followed by sigmoid. If confidence exceeds a threshold (τ=0. 85), the token's representation is frozen and it bypasses all remaining layers, contributing its final representation unchanged to the output. If confidence is below the threshold, the token continues to the next layer. Training applies an auxiliary entropy minimisation loss (weighted 0. 1) that pushes gate outputs toward 0 or 1, teaching the model to be decisive. The result: simple tokens such as punctuation and determiners exit by layer 2–3, common nouns exit around layer 6, technical terms exit around layer 9, and rare words use all 12 layers. Average per-token compute across a typical WikiText-2 batch drops to approximately 48% of the full-depth baseline, with no accuracy loss on the tokens that need deep processing. This is the first per-token adaptive depth mechanism demonstrated for autoregressive generation (prior work was restricted to classification). INNOVATION 4 — DUAL-STREAM FEED-FORWARD NETWORK The standard transformer FFN applies a single two-layer MLP to each token. TMT replaces this with two parallel streams: a syntax stream and a semantic stream, each with hidden dimension 256 (for a total width matching dₘodel=512). The syntax stream captures structural and grammatical patterns; the semantic stream captures meaning and topic representations. A lightweight learned sigmoid gate fuses the two streams per token, allowing the model to dynamically weight syntactic versus semantic processing depending on what each token requires. This decomposition consistently reduces validation perplexity even when the other three innovations are ablated out, and it introduces no additional parameters beyond the two stream projection matrices. INNOVATION 5 — EMA MEMORY ANCHORS Each TMT layer contains 16 persistent key-value parameter vectors — memory anchors — that are updated during training by an exponential moving average (β=0. 99) of the tokens that attend to them. Every token cross-attends to all 16 anchors within each layer, allowing the model to retrieve slowly-accumulated global context that is not present in the local sequence window. This provides a form of fast-weight storage (analogous to Ba et al. , 2016) without any recurrence or external memory read/write infrastructure. The anchors accumulate rare patterns and domain-level statistics across training steps and remain fixed at inference, acting as a compressed long-term context. EXPERIMENTAL RESULTS All experiments use WikiText-2 with the GPT-2 tokenizer, AdamW optimisation, cosine warmup LR schedule, and ~120M parameters per model for fair comparison. Full TMT (all five innovations active) achieves a validation perplexity of 29. 4, compared to 42. 1 for a parameter-matched vanilla transformer — a 30. 2% reduction in perplexity. Simultaneously, average per-token compute drops to ~48% of baseline. A full factorial ablation across all eight combinations of the three core innovations (mesh, decay, exit gate) reveals superadditive gains: the improvement from the full combination (12. 7 PPL points) substantially exceeds the sum of individual improvements (4. 3 + 1. 8 + 2. 5 = 8. 6 PPL points), confirming positive architectural interactions. Attention complexity analysis shows 128× fewer operations at S=1024 and 256× at S=2048 compared to standard attention. Exit gate analysis shows token compute is precisely stratified by linguistic complexity: punctuation averages 2. 1 layers, articles 3. 4, common nouns 5. 8, technical terms 9. 3, and rare words 11. 7 out of 12. ----RELEASE CONTENTS This deposit contains the full 20-page publication-quality paper in PDF format, including: abstract, introduction, related work covering six prior architectures, complete mathematical specification with 18 numbered equations, full architecture diagram, seven research figures (architecture overview, dynamic graph evolution, temporal decay analysis, exit gate distribution, training curves, ablation Pareto frontier, computational complexity), main results table, full ablation table, discussion of limitations and future work, conclusion, 15 literature references, and two appendices covering configuration reference and output field documentation. ---- source code, ablation experiment notebooks, test suites, and a 2, 231-row benchmark dataset are available at the linked GitHub and Hugging Face repositories. GitHub repository: https: //github. com/vignesh2027/TemporalMesh-TransformerHugging Face model: https: //huggingface. co/vigneshwar234/TemporalMesh-TransformerBenchmark dataset: https: //huggingface. co/datasets/vigneshwar234/TMT-Benchmarks
Building similarity graph...
Analyzing shared references across papers
Loading...
vigneshwar LK (Tue,) studied this question.
synapsesocial.com/papers/6a0ea196be05d6e3efb60653 — DOI: https://doi.org/10.5281/zenodo.20287390
vigneshwar LK
Building similarity graph...
Analyzing shared references across papers
Loading...