What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Key Points

Q2E effectively enhances text-to-video retrieval by decomposing queries with knowledge from large-language models.
Our evaluations on two diverse datasets demonstrate improved retrieval performance, outperforming state-of-the-art methods.
The entropy-based fusion scoring method improves the integration of diverse audio and visual knowledge for effective matching.
Integrating audio information significantly boosts the performance of zero-shot text-to-video retrieval systems.

Abstract

Recent approaches have shown impressive proficiency in extracting and leveraging parametric knowledge from Large-Language Models (LLMs) and Vision-Language Models (VLMs). In this work, we consider how we can improve the identification and retrieval of videos related to complex real-world events by automatically extracting latent parametric knowledge about those events. We present Q2E: a Query-to-Event decomposition method for zero-shot multilingual text-to-video retrieval, adaptable across datasets, domains, LLMs, or VLMs. Our approach demonstrates that we can enhance the understanding of otherwise overly simplified human queries by decomposing the query using the knowledge embedded in LLMs and VLMs. We additionally show how to apply our approach to both visual and speech-based inputs. To combine this varied multimodal knowledge, we adopt entropy-based fusion scoring for zero-shot fusion. Through evaluations on two diverse datasets and multiple retrieval metrics, we demonstrate that Q2E outperforms several state-of-the-art baselines. Our evaluation also shows that integrating audio information can significantly improve text-to-video retrieval. We have released code and data for future research.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Dipta et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68f6196ee0bbbc94fac36217 — DOI: https://doi.org/10.48550/arxiv.2506.10202

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach· 2024
Localizing Events in Videos with Multimodal Queries· 2024
Multimodal LLM-based Query Paraphrasing for Video Search· 2024 · 1 citations
Retrieval Augmented Zero-Shot Text Classification· 2024 · 9 citations
Q-Frame: Query-Aware Frame Selection and Multi-Resolution Adaptation for Video-LLMs· 2025

Authors

Shubhashis Roy Dipta

Francis Ferraro

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Q2E: Query-to-Event Decomposition for Zero-Shot Multilingual Text-to-Video Retrieval

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion