What question did this study set out to answer?

The objective is to enhance multimodal in-context learning by integrating task-specific retrieval and reasoning mechanisms.

March 17, 2026Open Access

GPT-MM: Improving Multimodal In-context Learning with Task-specific Retrieval and Reasoning

Key Points

The objective is to enhance multimodal in-context learning by integrating task-specific retrieval and reasoning mechanisms.
Proposed a unified ICL framework combining task-aware demonstration retrieval and label-induced reasoning.
Validated the framework using textual relation extraction as a primary test case.
Extended the framework to visual question answering and audio question answering.
The framework significantly narrowed the performance gap between ICL and fully supervised models.
Consistently outperformed GPT-3 and GPT-4 baselines across textual and multimodal benchmarks.
Achieved competitive or superior results compared to fine-tuned models.

Abstract

Large language models (LLMs) have exhibited impressive generalization through in-context learning (ICL), yet most studies focus on textual tasks, leaving the mechanisms that enable ICL to generalize across modalities largely unexplored. To bridge this gap, we propose a unified ICL framework that integrates task-aware demonstration retrieval and label-induced reasoning as two complementary components for improving both accuracy and interpretability. We first validate the framework in textual relation extraction (RE), a representative structured prediction task that challenges LLMs to infer fine-grained entity–relation semantics. Task-aware retrieval ensures that retrieved examples are semantically aligned with the target instance, while label-induced reasoning enriches each demonstration with label-grounded explanatory logic. These mechanisms substantially narrow the performance gap between ICL and fully supervised models. We then extend this framework to multimodal ICL, leveraging GPT-4o for visual question answering (VQA) and Whisper-large-v3 for audio question answering (AudioQA). Across both textual and multimodal benchmarks, our framework consistently outperforms GPT-3 and GPT-4 baselines and achieves competitive or superior results compared with fine-tuned models. These findings demonstrate that task-aware retrieval and label-induced reasoning together form a generalizable foundation for a unified in-context learning paradigm across modalities.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zhen Wan

Fei Cheng

Sadao Kurohashi

Journals

Journal of Natural Language Processing

Actions

Institutions

Kyoto University

National Institute of Informatics

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

GPT-MM: Improving Multimodal In-context Learning with Task-specific Retrieval and Reasoning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study