Monocular depth estimation is a central problem in computer vision with applications in robotics, augmented reality, and autonomous driving, yet the self-attention mechanisms used by modern Transformer architectures remain opaque. In this work, we integrate SVD-Inspired Attention (SVDA) into the Dense Prediction Transformer (DPT), introducing a spectrally structured attention formulation for dense prediction that decouples directional alignment from spectral modulation through a learnable diagonal matrix embedded in normalized query–key interactions. Experiments on KITTI and NYU-v2 show that SVDA preserves competitive predictive performance while enabling intrinsic interpretability: on KITTI, AbsRel improves from 0.058 to 0.056 and δ1 from 0.976 to 0.979, while on NYU-v2, AbsRel improves from 0.133 to 0.124 and δ1 from 0.865 to 0.872. This is achieved with only 0.01% additional parameters, at the cost of a measurable runtime overhead associated with the added normalization and spectral modulation. More importantly, SVDA enables six spectral indicators that quantify entropy, rank, sparsity, alignment, selectivity, and robustness, revealing consistent cross-dataset and depth-wise patterns in how attention organizes during training. These properties make the model easier to inspect and better suited to applications where transparency and reliability are important, such as robotics and autonomous navigation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Vasileios Arampatzakis
George Pavlidis
Nikolaos Mitianoudis
Mathematics
Democritus University of Thrace
Athena Research and Innovation Center In Information Communication & Knowledge Technologies
Building similarity graph...
Analyzing shared references across papers
Loading...
Arampatzakis et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69df2abce4eeef8a2a6afb1c — DOI: https://doi.org/10.3390/math14081272