Computer vision offers significant potential for the continuous, stress-free, and cost-effective monitoring of animal behavior, yet its application in goat farming remains limited. In this study, a Multi-Object Detection (MOD) model was developed to classify goat behavior into four categories: eating, standing, drinking, and lying. Zenithal videos were recorded under both light and dark conditions on an experimental goat farm, resulting in 10,740 labelled annotations used to train, validate, and test 13 models leveraging Transformer- and CNN-based pretrained architectures. YOLO-based models (YOLOv8 and YOLOX) achieved the highest overall performances across both large and lightweight versions, demonstrating high detection capability and potential suitability for hardware-constrained scenarios. YOLOX-based MOD model is preferred for goat behavior detection due to its superior classification accuracy, fast inference speed, and fully open-source license, enabling flexible customization, deployment, and reproducibility. Other models, particularly DAB-DETR and H-DINO, underperformed, especially in detecting drinking behavior, which represents the most challenging class due to its visual similarity with standing, class imbalance, and fisheye distortion effects that affect the frame regions where drinkers are located. Mitigation strategies, including focal loss and distortion correction, improved detection accuracy for this class and reduced performance variability. The developed MOD model can be deployed for continuous group-level monitoring of goats, paving the way for scalable and efficient solutions for advanced behavioral analyses. Future works will focus on integrating tracking algorithms for animal-level insights, as well as on evaluating model generalizability across different farming conditions and goat breeds. • Computer vision applications for goat farming are still very limited in literature. • Various models for goat behavior detection were developed and tested. • YOLOX achieved a mAP @ 0.50 of 0.957 with strong performances across the classes. • Drinking detection is hard due to visual similarity, class imbalance, distortion. • Frame undistortion improves drinking detection performance and reduces variability.
Méndez et al. (Sun,) studied this question.