Vision Graph Neural Networks (GNNs) offer a powerful alternative to CNNs and Transformers for modeling complex visual relationships. However, they still face two challenges: high computational cost of repeated global k-NN graph constructions and misalignment of rigid patch tokenization with object boundaries. We propose DiRAViG, which replaces fixed patches with boundary-aligned region tokens produced by a differentiable end-to-end assignment, and propagates on a fixed, sparse one-hop spatial contact graph with few-step diffusion. A bidirectional pixel–region pathway aggregates features into regions and projects them back to the image grid, preserving fine details and stabilizing training. On ImageNet-1K, DiRAViG-S achieves 78.7% Top-1 at 1.5 GMACs and DiRAViG-M reaches 81.5% at 4.2 GMACs. Compared to Pyramid ViG-S (∼4.6 GMACs) and ViHGNN-S (∼ GMACs), DiRAViG-M offers a better accuracy-efficiency trade-off. These results demonstrate that DiRAVIG offers a scalable and boundary-aware solution for efficient vision analysis.
Li et al. (Sat,) studied this question.