What question did this study set out to answer?

This research aims to understand how transformer models manage variable input lengths and investigate the role of different layers in syntactic processing.

February 11, 2026Open Access

Length-Dependent Processing Modes in Transformer Attention: Evidence from Multi-Architecture Ablation Studies

Key Points

This research aims to understand how transformer models manage variable input lengths and investigate the role of different layers in syntactic processing.
Conducted ablation studies on five transformer models (GPT-2, LLaMA, OPT) to analyze attention head functions.
Measured how attention heads operate independently versus collaboratively based on input length.
Introduced a redundancy score to quantify distributed processing across layers.
Identified a transition in processing modes at approximately 4 tokens in length for attention heads.
Demonstrated that longer inputs involve significantly higher coordination among attention heads, with redundancy scores exceeding 36.
Ablating Layer 0 in smaller models reduced garden-path effects by 83%, highlighting early layer specialization.

Abstract

We investigate how Transformer-based language models process inputs of varying lengths through systematic attention head ablation. Across five models from three architectural families (GPT-2, LLaMA, OPT), we identify a consistent processing mode transition: inputs shorter than approximately 4 tokens are handled by independently operating attention heads (redundancy score R ≈ 1–2), whereas longer inputs require coordinated multi-head integration (R > 36, up to 356×). We introduce the redundancy score, a simple diagnostic metric that quantifies the degree of distributed processing within a layer. Furthermore, all five models exhibit significant garden-path effects—elevated surprisal at syntactic disambiguation points (p < 0.05 in all cases)—and ablating Layer 0 in smaller models reduces this effect by 83%, suggesting that early layers specialize in initial syntactic commitment. These findings carry practical implications: (1) Layer 0 attention heads can be pruned for short-input tasks without performance loss, (2) syntactic processing is localized to early layers in smaller models but becomes distributed in larger ones, and (3) the redundancy score provides a model-agnostic tool for analyzing attention head coordination.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yūki Ichikawa

Actions

Institutions

Showa University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Length-Dependent Processing Modes in Transformer Attention: Evidence from Multi-Architecture Ablation Studies

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider