What question did this study set out to answer?

The aim is to assess how different multimodal large language models handle visual math problems and their refusal strategies when images are absent.

April 7, 2026Open Access

Seeing Is Solving: MLLMs, Reasoning, and Refusal in Visual Math

Key Points

The aim is to assess how different multimodal large language models handle visual math problems and their refusal strategies when images are absent.
Evaluated six multimodal large language models (three reasoning and three non-reasoning) on 376 Illustrative Mathematics items.
Classified problems based on the presence of critical visual information that cannot be derived from text alone.
Each model attempted each problem three times with and without associated figures under standardized scoring protocols.
Top reasoning models achieved mid-50% accuracy with images, while non-reasoning models scored mid-30s to low-40s.
Without images, models generally refused to answer rather than guessing, with very few correct answers by chance.
Visual misreading was the main failure for non-reasoning models, while reasoning models provided more correct answers and explanations.

Abstract

Many middle-school math problems are image-dependent: the diagram or graph carries essential information. This matters for intelligent tutoring and accessibility, where systems must reason over figures and also decline responsibly when figures are missing. We evaluate six contemporary multimodal large language models (MLLMs)—three reasoning models and three non-reasoning models—on 376 Illustrative Mathematics (IM) items labeled as image-role Required (the figure contains task-critical information not recoverable from text alone without added assumptions). Each model attempts every item three times with and without the figure under a shared prompt and scoring protocol. To reduce image-role label subjectivity, we classify items as not Required when they are solvable from text alone without additional assumptions. With images, the top-performing reasoning models achieve accuracy in the mid-50%, while non-reasoning models fall in the mid-30s to low-40s. Without images, models overwhelmingly refuse rather than guess, with only rare correct-by-chance answers. Models show moderate agreement on which items are solvable, and we release two benchmark subsets of items solved consistently across models. A qualitative audit of 83 items shows that visual misreading is the dominant failure mode for non-reasoning models, while reasoning models more often produce correct answers accompanied by adequate explanations. These results suggest tutoring systems should gate automated scoring and learner-model updates on visual-evidence availability and use scaffolds that require explicit visual-evidence binding before algebra. For accessibility, systems should treat no-image refusals as missing-context signals and elicit the figure or a structured description, enabling description-substitution experiments. We release code, prompts, and summary artifacts for replication. Code and data: https://osf.io/ct7bg/

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ethan Croteau

Neil T. Heffernan

Actions

Institutions

Worcester Polytechnic Institute

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Seeing Is Solving: MLLMs, Reasoning, and Refusal in Visual Math

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider