What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Strongly Connected Components Are All You Need: Graph-Theoretic Interpretability and Optimization for Vision Transformers

Key Points

Introducing a graph-theoretic framework reveals how attention patterns encode visual information, enhancing understanding.
Systematic evaluation across different architectures indicates that strongly connected components correspond to visual features.
Novel optimization strategy allows selective ablation of layers for improved inference speed with minimal accuracy loss.
ViT models demonstrate superior structural organization compared to DeiT and CLIP, providing a foundation for efficient model deployment.

Abstract

Abstract Vision Transformers have revolutionized computer vision with unprecedented performance, yet their attention mechanisms remain largely opaque, hindering both trust and optimization in critical applications 1, 2. Despite significant advances in transformer interpretability, existing approaches provide limited structural insights into how attention patterns encode visual information, creating a fundamental barrier to understanding these powerful models 3, 4. We address this challenge by introducing a novel, comprehensive graph-theoretic framework for analyzing Vision Transformer attention mechanisms. Our approach treats attention matrices as directed graphs and applies Tarjan’s Strongly Connected Components algorithm to reveal hidden structural patterns within attention flows 5. Through systematic evaluation of 22 pre-trained models across ViT, DeiT, and CLIP architectures, we discover that Strongly Connected Components with meaningful connectivity directly correspond to semantically coherent visual features—providing a mathematical foundation for understanding attention-based feature learning 6. This insight enables a novel optimization strategy: layers containing minimal meaningful connectivity can be selectively ablated through attention linearization, achieving 16-30% inference speedup with only 0-7% accuracy loss for large models. Our findings demonstrate that ViT models exhibit superior structural organization (92.6% effectiveness) compared to DeiT (51.1%) and CLIP (57.2%) variants, while patch32 configurations achieve optimal performance tradeoffs 7, 8. Beyond interpretability, this work establishes graph theory as a powerful lens for transformer analysis and provides practical tools for deploying efficient vision models without sacrificing semantic understanding 9.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Devansh Garg

Actions

Institutions

Indian Institute of Technology Mandi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Strongly Connected Components Are All You Need: Graph-Theoretic Interpretability and Optimization for Vision Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider