Vision Transformers (ViTs) achieve strong performance in natural and medical imaging, yet their decision processes remain opaque. This is especially problematic in high-stakes settings like chest X-ray interpretation. TransMM is among the strongest attribution methods for ViTs, combining attention with class-specific gradients to highlight influential image patches. We ask whether injecting semantic structure from Sparse Autoencoders (SAEs) can further improve the faithfulness of such attributions.We introduce Feature-Gradient Attribution, which extends TransMM’s principle from attention space to feature space. SAEs are trained on residual streams to decompose activations into sparse, interpretable features, providing per-patch feature activations. We project gradients onto the SAE feature basis and compute feature-gradient scores that capture both which learned features are present and how they influence the target logit. These scores yield per-patch gates that modulate TransMM’s attention maps before relevance propagation, forming a lightweight, semantically informed correction.Across three datasets (chest X-rays, endoscopy, natural images), two architectures (finetuned ViT-B/16 and contrastively pre-trained CLIP ViT-B/32), and three complementary faithfulness metrics, our method improves attribution faithfulness consistently. Improvements are statistically significant (p<0.001) on all three metrics for one dataset and on two of three metrics for the remaining datasets. We observe gains of 10.5-34.8% on SaCo and 9.7-43.0% on Faithfulness Correlation, with Pixel Flipping improving by 1.8-10.8%. Notably, we never observe degradation relative to TransMM on any metric–dataset combination.
Building similarity graph...
Analyzing shared references across papers
Loading...
Julius Šula
TU Wien
Building similarity graph...
Analyzing shared references across papers
Loading...
Julius Šula (Sun,) studied this question.
www.synapsesocial.com/papers/69ba43f74e9516ffd37a5bca — DOI: https://doi.org/10.34726/hss.2026.132361