We investigate how Transformer-based language models process inputs of varying lengths through systematic attention head ablation. Across five models from three architectural families (GPT-2, LLaMA, OPT), we identify a consistent processing mode transition: inputs shorter than approximately 4 tokens are handled by independently operating attention heads (redundancy score R ≈ 1–2), whereas longer inputs require coordinated multi-head integration (R > 36, up to 356×). We introduce the redundancy score, a simple diagnostic metric that quantifies the degree of distributed processing within a layer. Furthermore, all five models exhibit significant garden-path effects—elevated surprisal at syntactic disambiguation points (p < 0.05 in all cases)—and ablating Layer 0 in smaller models reduces this effect by 83%, suggesting that early layers specialize in initial syntactic commitment. These findings carry practical implications: (1) Layer 0 attention heads can be pruned for short-input tasks without performance loss, (2) syntactic processing is localized to early layers in smaller models but becomes distributed in larger ones, and (3) the redundancy score provides a model-agnostic tool for analyzing attention head coordination.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yūki Ichikawa (Mon,) studied this question.
www.synapsesocial.com/papers/698c1c46267fb587c655e8d8 — DOI: https://doi.org/10.5281/zenodo.18538836
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yūki Ichikawa
Showa University
Building similarity graph...
Analyzing shared references across papers
Loading...