Wepresent, to our knowledge, the first empirical comparison of Transformer attention and Mamba (Structured State Space Model) in Joint-Embedding Predictive Architecture (JEPA). While Mamba has shown competitive results in classification and generation tasks, its applicability to JEPA’s masked latent prediction objective remains unexplored. Wecompare 7 architectures—Transformer, Vanilla Mamba, Bidirectional Mamba (BiMamba), and 4 Sequential Attention variants—across 5 datasets ranging from simple images (Moving MNIST) to complex videos (HMDB-51). Our key finding is that fine grained temporal ambiguity in the task correlates with architecture suitability: on tasks with coarse temporal structure, Transformer remains competitive or better (ImageNet: BiMamba/TF = 1.10×; UCF-101: 1.31× ± 0.10), while on tasks requiring fine-grained temporal discrimination (HMDB-51), BiMamba consistently achieves roughly half the MSE of Transformer (0.55× ± 0.02, reproducible across 3 seeds).Wealso demonstrate why Sequential Attention approaches structurally fail for Mamba and confirm that modality-specific FFN separation remains beneficial even when allmodalities share the same loss function. This is a toy-scale empirical study. We study architectural trends rather than claim state of-the-art capability. Reported ratios should be interpreted as directional evidence, not production-ready benchmarks. All code, checkpoints, and results are publicly available.
Building similarity graph...
Analyzing shared references across papers
Loading...
Brian Kim (Sun,) studied this question.
www.synapsesocial.com/papers/69cb6589e6a8c024954b98d6 — DOI: https://doi.org/10.5281/zenodo.19323214
Brian Kim
Building similarity graph...
Analyzing shared references across papers
Loading...