What question did this study set out to answer?

This work aims to enhance the performance of instance segmentation models for robotic log grasping in forestry settings.

February 17, 2026Open Access

Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former

Key Points

This work aims to enhance the performance of instance segmentation models for robotic log grasping in forestry settings.
Integrated EfficientViT-SAM backbone into MP-Former framework
Benchmarked Mask2Former and MP-Former with Swin Transformer variants on TimberSeg 1.0 dataset
Evaluated In-house dataset for model generalization across real-world scenarios.
EfficientViT-SAM-XL1 MP-Former achieved an mAP of 61.05 on TimberSeg 1.0, surpassing previous models by +3.52 mAP.
Achieved a processing speed of 12 FPS, with a gain of +3.53 FPS.
On the In-house dataset, obtained an mAP of 67.06, aligning with memory efficiency despite more parameters.

Abstract

Segmenting individual timber logs in robotic grasping scenarios poses significant challenges due to cluttered arrangements, overlapping geometries, and visually uniform textures, requiring instance segmentation models that balance accuracy and computational efficiency. In this work, we study the integration of the EfficientViT-SAM backbone into the MP-Former framework to analyze its impact on segmentation accuracy, inference speed, and cross-dataset generalization in autonomous forestry applications. Our contributions are threefold: (1) we benchmark Mask2Former and MP-Former with different variants of Swin Transformer as backbones on the TimberSeg 1.0 dataset, (2) we study the use of the EfficientViT-SAM-XL architecture as an alternative encoder backbone to analyze its impact on inference speed and segmentation accuracy, and (3) we use an In-house dataset as a hold-out test set, comprising 113 images and 923 annotations in the annotated subset and 50 images in the unannotated subset, for evaluating model generalization under real-world deployment scenarios. On the TimberSeg 1.0 dataset, our top-performing model, EfficientViT-SAM-XL1 MP-Former, achieves an mAP of 61.05, outperforming the Swin-B Mask2Former of the TimberSeg 1.0 paper by +3.52 mAP, while running at 12 FPS (+3.53 FPS gain). When tested on our In-house dataset, the model attains an mAP of 67.06. Notably, it matches the memory efficiency of TimberSeg’s strongest baseline, despite having nearly double the number of parameters, demonstrating its practical viability for robotic applications in forestry environments.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sayan Mandal

Stefan Ainetter

Friedrich Fraundorfer

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Instance Segmentation in Autonomous Log Grasping Using EfficientViT-SAM MP-Former

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study