Large language models (LLMs) have exhibited impressive generalization through in-context learning (ICL), yet most studies focus on textual tasks, leaving the mechanisms that enable ICL to generalize across modalities largely unexplored. To bridge this gap, we propose a unified ICL framework that integrates task-aware demonstration retrieval and label-induced reasoning as two complementary components for improving both accuracy and interpretability. We first validate the framework in textual relation extraction (RE), a representative structured prediction task that challenges LLMs to infer fine-grained entity–relation semantics. Task-aware retrieval ensures that retrieved examples are semantically aligned with the target instance, while label-induced reasoning enriches each demonstration with label-grounded explanatory logic. These mechanisms substantially narrow the performance gap between ICL and fully supervised models. We then extend this framework to multimodal ICL, leveraging GPT-4o for visual question answering (VQA) and Whisper-large-v3 for audio question answering (AudioQA). Across both textual and multimodal benchmarks, our framework consistently outperforms GPT-3 and GPT-4 baselines and achieves competitive or superior results compared with fine-tuned models. These findings demonstrate that task-aware retrieval and label-induced reasoning together form a generalizable foundation for a unified in-context learning paradigm across modalities.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhen Wan
Fei Cheng
Sadao Kurohashi
Journal of Natural Language Processing
Kyoto University
National Institute of Informatics
Building similarity graph...
Analyzing shared references across papers
Loading...
Wan et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69b8ef6ddeb47d591b8c5825 — DOI: https://doi.org/10.5715/jnlp.33.207