What question did this study set out to answer?

The study aims to enhance aspect-based sentiment analysis by integrating both textual and visual data to improve sentiment classification accuracy.

May 26, 2026Open Access

An Attention-Driven Feature Fusion Approach for Multimodal Aspect-Based Sentiment Analysis

Key Points

The study aims to enhance aspect-based sentiment analysis by integrating both textual and visual data to improve sentiment classification accuracy.
Proposed an Attention-Driven Feature Fusion approach utilizing a three-stage hierarchical attention mechanism.
Fused text and image embeddings followed by aspect-level feature integration and multi-head attention.
Evaluated model performance on three benchmark datasets: Twitter-2015, Twitter-2017, and MASAD.
Achieved 82.55% accuracy and 81.05% F1-score on Twitter-2015.
Reported 77.07% accuracy and 77.15% F1-score on Twitter-2017.
Attained up to 99.67% accuracy and F1-score in the Plant domain of MASAD.

Abstract

Aspect-Based Sentiment Analysis explores sentiment trends related to specific opinion aspects and holds significant commercial potential for monitoring brand reputation, understanding customer satisfaction, and personalizing recommendations. However, traditional methods rely exclusively on textual input and often struggle when the target aspect is not mentioned in the sentence. Multimodal Aspect-Based Sentiment Analysis addresses this limitation by incorporating both textual and visual modalities to enable more comprehensive sentiment understanding. Despite advancements in deep learning and transformer-based architectures, existing models often suffer from suboptimal modality fusion and weak aspect grounding, limiting their classification accuracy. To overcome these challenges, we propose an Attention-Driven Feature Fusion (ADFF) approach based on a three-stage hierarchical attention mechanism. First, it only fuses text and image embeddings. Second, it incorporates aspect-level features. Third, a multi-head attention layer further enhances cross-modal dependencies. The resulting representation is passed to a Long Short-Term Memory (LSTM) classifier for sentiment polarity prediction. We evaluate our model on three benchmark datasets, namely Twitter-2015, Twitter-2017, and MASAD. The experimental results demonstrate that the proposed model substantially outperforms state-of-the-art multimodal and unimodal baselines, improves both accuracy and F1-score, achieving 82.55% accuracy and 81.05% F1-score on Twitter-2015, 77.07% accuracy and 77.15% F1-score on Twitter-2017, and up to 99.67% accuracy and F1-score in the Plant domain of MASAD, where we observe consistent improvements across all seven domains. These results highlight the effectiveness and scalability of the hierarchical attention-based fusion strategy for real-world aspect-based sentiment analysis tasks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper