What question did this study set out to answer?

The central aim is to improve depth estimation from endoscopic video sequences without requiring ground-truth depth labels.

February 2, 2026

Leveraging Spatiotemporal Cues for Self-Supervised Stereo Depth Estimation in Endoscopic Videos

Key Points

The central aim is to improve depth estimation from endoscopic video sequences without requiring ground-truth depth labels.
Developed a Siamese self-supervised framework for depth estimation.
Introduced two modules: Deformable Spatiotemporal Cross-View Fusion (DSCF) and Multi-Scale Selective Spatiotemporal Aggregation (MSSA).
Implemented a three-stage training protocol for effective deployment on edge devices.
Achieved state-of-the-art depth accuracy and image reconstruction quality.
Reduced root mean square error (RMSE) by over 9.1% compared to existing methods.

Abstract

Self-supervised endoscopic depth estimation seeks to reconstruct dense depth information from clinical endoscopic video sequences without the necessity of ground-truth depth annotations, thereby offering significant potential for widespread application in surgical environments. Nevertheless, the majority of current approaches treat each video frame as an independent static image, neglecting the inherent temporal correlations and dynamic scene changes present in endoscopic videos. This limitation frequently results in depth estimation artifacts such as flickering, geometric inconsistencies, and structural distortions. In this study, we introduce a Siamese self-supervised framework that simultaneously leverages stereo and temporal information to enhance depth estimation in dynamic endoscopic videos. Central to our approach are two spatiotemporal feature aggregation modules. The Deformable Spatiotemporal Cross-View Fusion (DSCF) module explicitly constructs stereo and temporal cost volumes and performs integrated fusion at the cost-volume level. The Multi-Scale Selective Spatiotemporal Aggregation (MSSA) module captures long-term state memory through cross-resolution and cross-frame hidden-state propagation, while refining features by selectively integrating multi-scale inputs and multi-receptive-field spatiotemporal representations. To facilitate efficient deployment on endoscopic edge devices, we propose a three-stage training protocol that incrementally incorporates temporal supervision and ultimately distills the model into a monocular depth estimation network. Comprehensive evaluations conducted on four publicly available endoscopic datasets (SCARED, SERV-CT, EndoNeRF, and Hamlyn) demonstrate that our method attains state-of-the-art performance in both depth accuracy and image reconstruction quality, achieving an average reduction in root mean square error (RMSE) exceeding 9.1% relative to the strongest existing method.

Bookmark

Leveraging Spatiotemporal Cues for Self-Supervised Stereo Depth Estimation in Endoscopic Videos

Key Points

Abstract

Cite This Study