Recent advancements in vision language models (VLMs) have opened new avenues for analyzing complex visual data. Models such as ChatGPT, Gemini, Llama and LLaVA have gained prominence for their ability to process both visual and textual data, excelling in tasks like natural image captioning, visual question answering (VQA), and reasoning. Similarly, the Segment Anything Model (SAM) by Meta has demonstrated remarkable segmentation capabilities. Given the importance of microscopy images in fields like biology, medicine, and materials science—where visual data is often analyzed alongside textual information from captions, reports, or literature—it is critical to evaluate the effectiveness of these models on such data. This study assesses the capabilities of ChatGPT-5, Gemini-2.5Pro, Llama-3.2V, LLaVA-1.5 and SAM-2 on classification, segmentation, counting, and VQA tasks using microscopy images. ChatGPT and Gemini excelled in comprehending microscopy images, while SAM performed well in object isolation. Although their performance falls short of domain expert accuracy, particularly when faced with complexities such as impurities, overlaps, and irrelevant artifacts, these models show clear gains compared to prior versions. These findings highlight the promise of VLMs in scientific image analysis and the need for further advancements to meet the demands of expert-level tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Verma et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69eb08ef553a5433e34b3980 — DOI: https://doi.org/10.1038/s41524-026-02069-y
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Prateek Verma
Minh–Hao Van
Xintao Wu
npj Computational Materials
University of Arkansas at Fayetteville
Building similarity graph...
Analyzing shared references across papers
Loading...