Segmenting individual timber logs in robotic grasping scenarios poses significant challenges due to cluttered arrangements, overlapping geometries, and visually uniform textures, requiring instance segmentation models that balance accuracy and computational efficiency. In this work, we study the integration of the EfficientViT-SAM backbone into the MP-Former framework to analyze its impact on segmentation accuracy, inference speed, and cross-dataset generalization in autonomous forestry applications. Our contributions are threefold: (1) we benchmark Mask2Former and MP-Former with different variants of Swin Transformer as backbones on the TimberSeg 1.0 dataset, (2) we study the use of the EfficientViT-SAM-XL architecture as an alternative encoder backbone to analyze its impact on inference speed and segmentation accuracy, and (3) we use an In-house dataset as a hold-out test set, comprising 113 images and 923 annotations in the annotated subset and 50 images in the unannotated subset, for evaluating model generalization under real-world deployment scenarios. On the TimberSeg 1.0 dataset, our top-performing model, EfficientViT-SAM-XL1 MP-Former, achieves an mAP of 61.05, outperforming the Swin-B Mask2Former of the TimberSeg 1.0 paper by +3.52 mAP, while running at 12 FPS (+3.53 FPS gain). When tested on our In-house dataset, the model attains an mAP of 67.06. Notably, it matches the memory efficiency of TimberSeg’s strongest baseline, despite having nearly double the number of parameters, demonstrating its practical viability for robotic applications in forestry environments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sayan Mandal
Stefan Ainetter
Friedrich Fraundorfer
Building similarity graph...
Analyzing shared references across papers
Loading...
Mandal et al. (Sun,) studied this question.
www.synapsesocial.com/papers/699405bb4e9c9e835dfd68e5 — DOI: https://doi.org/10.3390/robotics15020044