Self-supervised endoscopic depth estimation seeks to reconstruct dense depth information from clinical endoscopic video sequences without the necessity of ground-truth depth annotations, thereby offering significant potential for widespread application in surgical environments. Nevertheless, the majority of current approaches treat each video frame as an independent static image, neglecting the inherent temporal correlations and dynamic scene changes present in endoscopic videos. This limitation frequently results in depth estimation artifacts such as flickering, geometric inconsistencies, and structural distortions. In this study, we introduce a Siamese self-supervised framework that simultaneously leverages stereo and temporal information to enhance depth estimation in dynamic endoscopic videos. Central to our approach are two spatiotemporal feature aggregation modules. The Deformable Spatiotemporal Cross-View Fusion (DSCF) module explicitly constructs stereo and temporal cost volumes and performs integrated fusion at the cost-volume level. The Multi-Scale Selective Spatiotemporal Aggregation (MSSA) module captures long-term state memory through cross-resolution and cross-frame hidden-state propagation, while refining features by selectively integrating multi-scale inputs and multi-receptive-field spatiotemporal representations. To facilitate efficient deployment on endoscopic edge devices, we propose a three-stage training protocol that incrementally incorporates temporal supervision and ultimately distills the model into a monocular depth estimation network. Comprehensive evaluations conducted on four publicly available endoscopic datasets (SCARED, SERV-CT, EndoNeRF, and Hamlyn) demonstrate that our method attains state-of-the-art performance in both depth accuracy and image reconstruction quality, achieving an average reduction in root mean square error (RMSE) exceeding 9.1% relative to the strongest existing method.
Wang et al. (Thu,) studied this question.