What question did this study set out to answer?

The aim is to enhance the generalization of video object segmentation in challenging scenarios involving complex objects and varying conditions.

April 10, 2026Open Access

Domain Generalization for Multiple Video Object Segmentation and Tracking Using Transformers and Smart Memory

Key Points

The aim is to enhance the generalization of video object segmentation in challenging scenarios involving complex objects and varying conditions.
Developed MuSMem architecture combining multiple innovations for VOS and tracking.
Incorporated smart memory to manage key frames based on relevance and freshness.
Utilized monocular depth maps to improve robustness against occlusions.
Achieved first place on VOTSt-2024 and Long Video Dataset benchmarks.
Demonstrated significant reduction in tracking drift and improved segmentation accuracy.
Enhanced long-term prediction performance and memory efficiency.

Abstract

Abstract Video Object Segmentation (VOS) is a key component in computer vision applications, including surveillance, autonomous driving, and robotics. However, existing VOS models often struggle with generalization to new videos with complex, topologically transforming deformable objects (eg. cooking, assembling, state change), degraded environments and long video sequences, resulting in tracking drift, low recall and memory saturation. We developed Mu ltiple object VOS and tracking S mart Mem ory architecture (MuSMem), a generalizable approach that incorporates three key innovations: (i) fusing SAM with High-Quality masks alongside appearance-based candidate-selection to refine coarse segmentation masks, resulting in improved object boundaries; (ii) dynamic smart memory that manages a history of key frames based on a novel information preserving gain , combined with relevance and freshness spatio-temporal criteria; and (iii) explores the use of monocular depth maps for occlusion robustness. MuSMem significantly reduces memory usage, reduces drift, tracks complex object topological changes and improves long-term prediction performance. MuSMem can be integrated with Vision-Language Models (VLMs) for zero-shot generalization to unseen visual domains. Experiments using VOS benchmark datasets show that MuSMem ranks first on VOTSt-2024, Long Video Dataset and LVOS, and second on VOTS-2024, demonstrating the best generalizability and state-of-the-art performance across single-, multi-, and complex VOS tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Elham Soltani Kazemi

Imad Eddine Toubal

Gani Rahmon

Journals

International Journal of Computer Vision

Actions

Institutions

University of Missouri

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Domain Generalization for Multiple Video Object Segmentation and Tracking Using Transformers and Smart Memory

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider