What question did this study set out to answer?

This research aims to enhance question answering performance in real-time using a multimodal approach.

May 9, 2026Open Access

Retrieval Augmented Generation Using Multimodal Large Language Models for Real-Time Knowledge-Grounded Question Answering

Key Points

This research aims to enhance question answering performance in real-time using a multimodal approach.
Introduced MultiRAG framework integrating multimodal large language models with a real-time QA system.
Used a dense bi-encoder retrieval backbone and a vision-language model for processing and generation.
Conducted experiments on four benchmark datasets including Natural Questions and RKUB-2024.
Achieved 87.3% Exact Match and 91.4% answer faithfulness score on open-domain QA.
Demonstrated a 6.7× reduction in hallucination rate compared to standard LLM baselines.
Reduced hallucination by 82% over standard LLM deployment, outperforming retrieval-augmented models by 4.2–9.8 percentage points.

Abstract

The exponential growth of heterogeneous digital information across structured and unstructured repositories presents a critical challenge for large language models (LLMs): the inability to access and reason over dynamically evolving knowledge without costly model retraining. This paper introduces a comprehensive Retrieval Augmented Generation (RAG) framework that integrates multimodal large language models (MLLMs) with real-time, knowledge-grounded question answering systems. The proposed architecture — MultiRAG — combines a dense bi-encoder retrieval backbone with a cross-modal fusion module capable of jointly indexing and retrieving text, images, tables, and structured data. Retrieved multimodal evidence is processed by a vision-language model (VLM) serving as the generative backbone, conditioned on retrieved context through a novel cross-attention grounding mechanism that attenuates hallucination by enforcing faithfulness constraints at the token level. Experiments conducted on four benchmark datasets — Natural Questions, WebQA, MultiModalQA, and a custom real-time knowledge update benchmark (RKUB-2024) — demonstrate that MultiRAG achieves 87.3% Exact Match on open-domain QA, 91.4% answer faithfulness score, and 6.7× reduction in hallucination rate compared to vanilla LLM baselines. Real-time knowledge ingestion pipeline latency averages 340 ms per document, supporting continuous knowledge grounding without model fine-tuning. The system reduces hallucination by 82% over standard LLM deployment and outperforms all retrieval-augmented baselines by 4.2–9.8 percentage points across evaluation metrics

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Dr. K. Sujatha (Thu,) studied this question.

www.synapsesocial.com/papers/69fecfcdb9154b0b82876cb2 — DOI: https://doi.org/10.5281/zenodo.20068395

Retrieval Augmented Generation Using Multimodal Large Language Models for Real-Time Knowledge-Grounded Question Answering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion