With the exponential growth of multimodal data, the limitations of traditional unimodal models in cross-modal understanding and complex scenario reasoning have become increasingly evident. Built upon the foundation of Large Language Models (LLMs), Multimodal Large Language Models (MLLMs) retain strong reasoning abilities and demonstrate unique capabilities in multimodal understanding. This survey provides a comprehensive overview of the current research landscape of MLLMs. It systematically analyzes mainstream model architectures, training, fine-tuning strategies, and task classifications, while offering a structured account of evaluation methodologies. Beyond synthesis, the paper highlights emerging trends that aim for balanced integration across modalities, tasks, and components, and critically examines key challenges together with potential solutions. The survey specifically emphasizes recent reasoning-oriented MLLMs, with a focus on DeepSeek-R1, analyzing their design paradigms and contributions from the perspective of symmetric reasoning capabilities. Overall, this work offers a comprehensive overview of cutting-edge advancements and lays a foundation for the future development of MLLMs, especially those guided by symmetry principles.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xinran Liu
Haojie Liu
Symmetry
Zhejiang University
Sichuan University
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68bb3efd2b87ece8dc957dc9 — DOI: https://doi.org/10.3390/sym17091400