While multimodal LLMs (MLLMs) demonstrate remarkable reasoning progress, their application in specialized scientific domains like physics reveals significant gaps in current evaluation benchmarks. Specifically, existing benchmarks often lack fine-grained subject coverage, neglect the step-by-step reasoning process, and are predominantly English-centric, failing to systematically evaluate the role of visual information. Therefore, we introduce Multi-Physics for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels, featuring 1, 412 image-associated, multiple-choice questions spanning 11 high-school physics subjects. We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought. Furthermore, we systematically study the impact of difficulty level and visual information by comparing the model performance before and after changing the input mode. Our work provides not only a fine-grained resource for the community but also offers a robust methodology for dissecting the multimodal reasoning process of state-of-the-art MLLMs, and our dataset and code have been open-sourced: https: //github. com/luozhongze/Multi-Physics.
Building similarity graph...
Analyzing shared references across papers
Loading...
Z.-Q. Luo
Yin Zhou
Yong‐Xin Guo
Building similarity graph...
Analyzing shared references across papers
Loading...
Luo et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68de6f4283cbc991d0a22ec8 — DOI: https://doi.org/10.48550/arxiv.2509.15839
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: