What question did this study set out to answer?

This research aims to evaluate the effectiveness of multimodal large language models in solving visually presented mathematical problems across multiple languages.

February 14, 2026Open Access

Evaluating visual mathematics in multimodal LLMs: a multilingual benchmark based on the Kangaroo tests

Key Points

This research aims to evaluate the effectiveness of multimodal large language models in solving visually presented mathematical problems across multiple languages.
Assessment of various MLLMs including GPT-4o, Pixtral, Qwen-VL, Llama 3.2 Vision, and Gemini 2.0 Flash.
Utilization of a multilingual Kangaroo-style benchmark in English, French, Spanish, and Catalan.
Analysis of models' performance in categories like geometry, visual algebra, and combinatorics.
Overall accuracy remains moderate across diverse mathematical topics with no model excelling in all areas.
Models show limited performance improvements on non-image questions, revealing underutilisation of visual inputs.
Language-specific variations reveal struggles with advanced geometry and combinatorial tasks.
Gemini 2.0 Flash achieves the highest accuracy in image-based tasks, though none reach human-level performance.

Abstract

Abstract Multimodal Large Language Models (MLLMs) promise advanced vision-language capabilities, yet their effectiveness in visually presented mathematics remains underexplored. This paper analyses the development and evaluation of MLLMs for mathematical problem-solving, focusing on diagrams, multilingual text, and symbolic notation. The computational demands of evaluating these large-scale models across multilingual datasets necessitate high-performance computing infrastructure, as systematic benchmarking of state-of-the-art MLLMs requires distributed processing of thousands of inference requests and parallel evaluation across multiple model architectures. We then assess several models-including GPT-4o, Pixtral, Qwen-VL, Llama 3.2 Vision variants, and Gemini 2.0 Flash-in a multilingual Kangaroo-style benchmark spanning English, French, Spanish, and Catalan. Our experiments reveal four key findings. First, overall accuracy remains moderate across geometry, visual algebra, logic, patterns, and combinatorics: no single model excels in every topic. Second, whilst most models see improved accuracy with questions that do not have images, the gain is often limited; performance for some remains nearly unchanged without visual input, indicating underutilisation of diagrammatic information. Third, substantial variation exists across languages and difficulty levels: models frequently handle easier items but struggle with advanced geometry and combinatorial reasoning. Notably, Gemini 2.0 Flash achieves the highest accuracy on image-based tasks, followed by Qwen-VL 2.5 72B and GPT-4o, though none approach human-level performance. Fourth, a complementary analysis aimed at distinguishing whether models reason or simply recite reveals that Gemini and GPT-4o stand out for their structured reasoning and consistent accuracy. In contrast, Pixtral and Llama exhibit less consistent reasoning, often defaulting to heuristics or randomness when unable to align their outputs with the given answer options. Furthermore, detailed error analysis identifies two primary failure modes: encoding-stage errors, where models misidentify visual elements such as colours or shapes, and visio-semantic processing errors, where models struggle with three-dimensional spatial reasoning and geometric relationships, revealing systematic limitations even in state-of-the-art architectures.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Igualde-Sáez et al. (Thu,) studied this question.

www.synapsesocial.com/papers/699011522ccff479cfe57d2b — DOI: https://doi.org/10.1007/s11227-026-08291-1

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

The Revolution of Multimodal Large Language Models: A Survey· 2024 · 63 citations
LLM performance on mathematical reasoning in Catalan language· 2025 · 3 citations
VQA: Visual Question Answering· 2015 · 4,260 citations
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models· 2024 · 22 citations
Solving Geometry Problems: Combining Text and Diagram Interpretation· 2015 · 153 citations

Authors

Arnau Igualde-Sáez

Lamyae Rhomrasi

Yusef Ahsini

Journals

The Journal of Supercomputing

Actions

Institutions

University of Michigan

Universitat Politècnica de València

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating visual mathematics in multimodal LLMs: a multilingual benchmark based on the Kangaroo tests

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion