What question did this study set out to answer?

The aim is to evaluate the effectiveness of vision language models in analyzing microscopy images for various tasks.

April 24, 2026Open Access

Vision language models for scientific image analysis: an evaluation highlighting opportunities and challenges

Key Points

The aim is to evaluate the effectiveness of vision language models in analyzing microscopy images for various tasks.
Assessment of models including ChatGPT-5, Gemini-2.5Pro, Llama-3.2V, LLaVA-1.5, and SAM-2.
Tasks include classification, segmentation, counting, and visual question answering using microscopy images.
Comparison of performance against prior model versions and domain expert accuracy.
ChatGPT and Gemini excelled in comprehending microscopy images, achieving higher scores.
SAM demonstrated strong performance in object isolation and segmentation tasks.
Despite improvements, performance did not reach domain expert accuracy, especially with complex images.

Abstract

Recent advancements in vision language models (VLMs) have opened new avenues for analyzing complex visual data. Models such as ChatGPT, Gemini, Llama and LLaVA have gained prominence for their ability to process both visual and textual data, excelling in tasks like natural image captioning, visual question answering (VQA), and reasoning. Similarly, the Segment Anything Model (SAM) by Meta has demonstrated remarkable segmentation capabilities. Given the importance of microscopy images in fields like biology, medicine, and materials science—where visual data is often analyzed alongside textual information from captions, reports, or literature—it is critical to evaluate the effectiveness of these models on such data. This study assesses the capabilities of ChatGPT-5, Gemini-2.5Pro, Llama-3.2V, LLaVA-1.5 and SAM-2 on classification, segmentation, counting, and VQA tasks using microscopy images. ChatGPT and Gemini excelled in comprehending microscopy images, while SAM performed well in object isolation. Although their performance falls short of domain expert accuracy, particularly when faced with complexities such as impurities, overlaps, and irrelevant artifacts, these models show clear gains compared to prior versions. These findings highlight the promise of VLMs in scientific image analysis and the need for further advancements to meet the demands of expert-level tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Verma et al. (Tue,) studied this question.

www.synapsesocial.com/papers/69eb08ef553a5433e34b3980 — DOI: https://doi.org/10.1038/s41524-026-02069-y

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Annotated high-throughput microscopy image sets for validation· 2012 · 619 citations
Adaptive characterization of microstructure dataset using a two stage machine learning approach· 2020 · 66 citations
Machine learning predictions on fracture toughness of multiscale bio-nano-composites· 2020 · 72 citations
A liquid crystal-based biomaterial platform for rapid sensing of heat stress using machine learning· 2024 · 3 citations

Authors

Prateek Verma

Minh–Hao Van

Xintao Wu

Journals

npj Computational Materials

Actions

Institutions

University of Arkansas at Fayetteville

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Vision language models for scientific image analysis: an evaluation highlighting opportunities and challenges

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion