DiRAViG: Differentiable Region Assignment Vision Graph Networks

Key Points

DiRAViG significantly enhances accuracy, achieving 81.5% Top-1 score with only 4.2 GMACs.
The proposed model replaces fixed patches with differentiable boundary-aligned region tokens for better performance.
Using a sparse one-hop spatial contact graph allows for efficient processing, reducing computational requirements.
This efficient framework potentially improves visual analysis tasks by addressing key limitations in current models.

Abstract

Vision Graph Neural Networks (GNNs) offer a powerful alternative to CNNs and Transformers for modeling complex visual relationships. However, they still face two challenges: high computational cost of repeated global k-NN graph constructions and misalignment of rigid patch tokenization with object boundaries. We propose DiRAViG, which replaces fixed patches with boundary-aligned region tokens produced by a differentiable end-to-end assignment, and propagates on a fixed, sparse one-hop spatial contact graph with few-step diffusion. A bidirectional pixel–region pathway aggregates features into regions and projects them back to the image grid, preserving fine details and stabilizing training. On ImageNet-1K, DiRAViG-S achieves 78.7% Top-1 at 1.5 GMACs and DiRAViG-M reaches 81.5% at 4.2 GMACs. Compared to Pyramid ViG-S (∼4.6 GMACs) and ViHGNN-S (∼ GMACs), DiRAViG-M offers a better accuracy-efficiency trade-off. These results demonstrate that DiRAVIG offers a scalable and boundary-aware solution for efficient vision analysis.

DiRAViG: Differentiable Region Assignment Vision Graph Networks

Key Points

Abstract

Cite This Study