What question did this study set out to answer?

The aim is to review and synthesize advancements in multimodal text analytics, focusing on diverse tasks and methodologies. The paper identifies gaps in current research.

April 10, 2026

Decoding Multimodal Text Analytics: Tasks, Datasets, Fusion Models, and Future Frontiers

Key Points

The aim is to review and synthesize advancements in multimodal text analytics, focusing on diverse tasks and methodologies. The paper identifies gaps in current research.
Systematic analysis of over 160 research studies
Categorization of more than 120 state-of-the-art models
Review of ten core text analytics tasks
Performance comparison using various datasets
Exploration of under-explored tasks like personality detection
Multimodal models show 18%-25% F 1-score improvements over text-only baselines
Identification of gaps in modality fusion and data set diversity
Core tasks in multimodal text analytics are unified into a clear overview

Abstract

ABSTRACT It is estimated that the volume of data on the digital fronts will grow exponentially to reach a volume of 180 zettabytes by 2025, and more than 90% of this data will be of unstructured forms. The unimodal to multimodal text analytics (MTA) has been triggered by this phenomenon. The early introduction of the multimodal text were observed in scholarly literature and industrial use‐cases during the early 2010s. Since then, it has greatly expanded its horizons in other sectors such as healthcare, e‐commerce, education and public safety. This survey presents a task‐oriented, modality‐inclusive, and dataset‐aware synthesis of recent advancements in MTA, which offers an in‐depth review of 10 core text analytics tasks through a multimodal lens. We systematically analyze over 160 research studies and categorize more than 120 state‐of‐the‐art models, spanning fusion strategies, representation learning, transformer architectures, and pretrained vision‐language frameworks (e.g., CLIP, ViLBERT). In a variety of datasets including CMU‐MOSI, CMU‐MOSEI, IEMOCAP, and MAViT‐Bangla, multimodal models achieve up to 18%–25% F 1‐score improvements over text‐only baselines, captured in the standardized task‐wise comparison tables that are part of this survey. Moreover, this survey discusses seven under‐explored tasks, including personality detection, satire detection, and author profiling, and elaborates gaps in research in modality fusion, diversity of data sets, and social inclusivity in these tasks. It does not only fill gaps in the current literature by unifying knowledge in different fields, but also offers researchers working on MTA a future path. It is the first survey that puts all the key tasks within multimodal text analytics into a contiguous and consistent overview compared to other surveys that either refer to multimodal computing at an administrative level or concentrate on a specific task. This article is categorized under: Algorithmic Development > Text Mining Algorithmic Development > Web Mining Application Areas > Society and Culture

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tanusree Nath

Vedika Gupta

Manjari Gupta

Journals

Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery

Actions

Institutions

University of Tartu

Banaras Hindu University

Punjabi University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Decoding Multimodal Text Analytics: Tasks, Datasets, Fusion Models, and Future Frontiers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study