We introduce K-Operators, a kernel-decomposed sequence modeling architecture that replaces attention entirely with structured causal kernel operators. On Tiny Shakespeare character-level modeling, a 1.14M-parameter K-Operators model achieves 4.43 ±0.05 validation perplexity across 7 seeds—approaching the 4.35 PPL of a 10.65M-parameter Transformer baseline (nanoGPT) while using 9.3×fewer parameters and requiring no positional encodings. The architecture decomposes sequence mixing into a hierarchy of operators: K1 layers for position-wise feature mixing, K2 layers for causal sequence interaction via a learned base kernel combined with low-rank gamma-decayed recurrence, and a K0 layer for final rescaling. These are composed into a K-Stack backbone (K1 →K(×N ) 2 →K1 →K0) and refined through a learned iterative equilibrium loop governed by a scalar step-size η. Two interchangeable gamma-decay backends (mask and block) offer different memory/speed trade-offs. Diagnostic analysis reveals interpretable learned dynamics: the model progressively transfers sequence mixing from the initialized base kernel to the adaptive recurrent path, develops per-layer functional specialization, and learns to self-regulate the refinement loop—including robustness to 10×learning rate misspecification via automatic η suppression.
Building similarity graph...
Analyzing shared references across papers
Loading...
Aileen Koneko (Fri,) studied this question.
www.synapsesocial.com/papers/69b6069b83145bc643d1ca2f — DOI: https://doi.org/10.5281/zenodo.19004568
Aileen Koneko
Building similarity graph...
Analyzing shared references across papers
Loading...