To address the challenge of pose estimation in intelligent long-staple cotton harvesting caused by dense plant distribution and pronounced spatial heterogeneity of bolls, this study introduces a pose estimation method that integrates growth-relation keypoint constraints with vision–language model reasoning. A YOLO for cotton and stem segmentation (YOLO-CSS) model is developed to achieve fine-grained boll–stem segmentation under complex field conditions. Mathematical representations of the fractal characteristics and structural complexity of cotton plants are established, and growth structure modeling is used to analyze boll–stem spatial relationships, providing structural priors for subsequent orientation estimation. The study developed a cotton boll orientation reasoning method based on vision–language model understanding (VLM-OR), where the model evaluates the reliability of stem-growth keypoint extraction, establishes growth direction priors from these keypoints, and incorporates boll attachment direction for rule-based reasoning, enabling orientation estimation under weak-texture conditions. Furthermore, A boll–stem cooperative localization method is formulated through spatial geometric reasoning, using stem-growth keypoints as spatial anchors to derive the 3D picking pose of bolls and compensate for depth-direction positioning errors of the end effector, thus supporting dynamic alignment between perceptual outputs and execution parameters. Experimental results show that the proposed VLM-OR achieves an orientation success rate of 94.1 % in complex scenarios. Additionally, 74 % of depth keypoint errors remain below 3 mm, the orientation-based grasping success rate reaches 80 %, and more than 65 % of picking attempts succeed within a ± 20° tolerance range. These findings confirm the method’s accuracy and operational adaptability, offering strong methodological support for visual perception in long-staple cotton picking robots. • A pose estimation method for long-staple cotton picking using growth-relation keypoints and vision-language model reasoning. • A mathematical representation method of the fractal characteristics and structural complexity of cotton plants. • A cotton boll orientation reasoning method based on vision-language model understanding. • Cotton boll-stem coordinated positioning method, compensating for depth positioning errors in actuators. • Achieves 74 % of depth localization errors within 3 mm and a pose-grasping success rate of 80 %.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhi Liang
Xiaojuan Li
Zhonglong Lin
Industrial Crops and Products
Chinese Academy of Agricultural Sciences
Xinjiang University
Xinjiang Production and Construction Corps
Building similarity graph...
Analyzing shared references across papers
Loading...
Liang et al. (Fri,) studied this question.
synapsesocial.com/papers/69a528ecf1e85e5c73bf05bc — DOI: https://doi.org/10.1016/j.indcrop.2026.122967