Many middle-school math problems are image-dependent: the diagram or graph carries essential information. This matters for intelligent tutoring and accessibility, where systems must reason over figures and also decline responsibly when figures are missing. We evaluate six contemporary multimodal large language models (MLLMs)—three reasoning models and three non-reasoning models—on 376 Illustrative Mathematics (IM) items labeled as image-role Required (the figure contains task-critical information not recoverable from text alone without added assumptions). Each model attempts every item three times with and without the figure under a shared prompt and scoring protocol. To reduce image-role label subjectivity, we classify items as not Required when they are solvable from text alone without additional assumptions. With images, the top-performing reasoning models achieve accuracy in the mid-50%, while non-reasoning models fall in the mid-30s to low-40s. Without images, models overwhelmingly refuse rather than guess, with only rare correct-by-chance answers. Models show moderate agreement on which items are solvable, and we release two benchmark subsets of items solved consistently across models. A qualitative audit of 83 items shows that visual misreading is the dominant failure mode for non-reasoning models, while reasoning models more often produce correct answers accompanied by adequate explanations. These results suggest tutoring systems should gate automated scoring and learner-model updates on visual-evidence availability and use scaffolds that require explicit visual-evidence binding before algebra. For accessibility, systems should treat no-image refusals as missing-context signals and elicit the figure or a structured description, enabling description-substitution experiments. We release code, prompts, and summary artifacts for replication. Code and data: https://osf.io/ct7bg/
Building similarity graph...
Analyzing shared references across papers
Loading...
Ethan Croteau
Neil T. Heffernan
Worcester Polytechnic Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Croteau et al. (Sat,) studied this question.
www.synapsesocial.com/papers/69d49fa9b33cc4c35a2280db — DOI: https://doi.org/10.5281/zenodo.19420819
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: