We investigate how Transformer-based language models process inputs of varying lengths through systematic attention head ablation. Across five models from three architectural families (GPT-2, LLaMA, OPT), we identify a consistent processing mode transition: inputs shorter than approximately 4 tokens are handled by independently operating attention heads (redundancy score R ≈ 1–2), whereas longer inputs require coordinated multi-head integration (R > 36, up to 356×). We introduce the redundancy score, a simple diagnostic metric that quantifies the degree of distributed processing within a layer. Furthermore, all five models exhibit significant garden-path effects—elevated surprisal at syntactic disambiguation points (p < 0.05 in all cases)—and ablating Layer 0 in smaller models reduces this effect by 83%, suggesting that early layers specialize in initial syntactic commitment. These findings carry practical implications: (1) Layer 0 attention heads can be pruned for short-input tasks without performance loss, (2) syntactic processing is localized to early layers in smaller models but becomes distributed in larger ones, and (3) the redundancy score provides a model-agnostic tool for analyzing attention head coordination.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yūki Ichikawa
Showa University
Building similarity graph...
Analyzing shared references across papers
Loading...
Yūki Ichikawa (Mon,) studied this question.
www.synapsesocial.com/papers/698c1c46267fb587c655e8d8 — DOI: https://doi.org/10.5281/zenodo.18538836
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: