August 26, 2024Open Access

MMR: Evaluating Reading Ability of Large Multimodal Models

Key Points

Key points are not available for this paper at this time.

Abstract

Large multimodal models (LMMs) have demonstrated impressive capabilities in understanding various types of image, including text-rich images. Most existing text-rich image benchmarks are simple extraction-based question answering, and many LMMs now easily achieve high scores. This means that current benchmarks fail to accurately reflect performance of different models, and a natural idea is to build a new benchmark to evaluate their complex reasoning and spatial understanding abilities. In this work, we propose the Multi-Modal Reading (MMR) benchmark in 11 diverse tasks to evaluate LMMs for text-rich image understanding. MMR is the first text-rich image benchmark built on human annotations with the help of language models. By evaluating several state-of-the-art LMMs, including GPT-4o, it reveals the limited capabilities of existing LMMs underscoring the value of our benchmark.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chen et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e5b010b6db64358754933e — DOI: https://doi.org/10.48550/arxiv.2408.14594

Authors

Jian Chen

Ruiyi Zhang

Yufan Zhou

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MMR: Evaluating Reading Ability of Large Multimodal Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion