Aspect-Based Sentiment Analysis explores sentiment trends related to specific opinion aspects and holds significant commercial potential for monitoring brand reputation, understanding customer satisfaction, and personalizing recommendations. However, traditional methods rely exclusively on textual input and often struggle when the target aspect is not mentioned in the sentence. Multimodal Aspect-Based Sentiment Analysis addresses this limitation by incorporating both textual and visual modalities to enable more comprehensive sentiment understanding. Despite advancements in deep learning and transformer-based architectures, existing models often suffer from suboptimal modality fusion and weak aspect grounding, limiting their classification accuracy. To overcome these challenges, we propose an Attention-Driven Feature Fusion (ADFF) approach based on a three-stage hierarchical attention mechanism. First, it only fuses text and image embeddings. Second, it incorporates aspect-level features. Third, a multi-head attention layer further enhances cross-modal dependencies. The resulting representation is passed to a Long Short-Term Memory (LSTM) classifier for sentiment polarity prediction. We evaluate our model on three benchmark datasets, namely Twitter-2015, Twitter-2017, and MASAD. The experimental results demonstrate that the proposed model substantially outperforms state-of-the-art multimodal and unimodal baselines, improves both accuracy and F1-score, achieving 82.55% accuracy and 81.05% F1-score on Twitter-2015, 77.07% accuracy and 77.15% F1-score on Twitter-2017, and up to 99.67% accuracy and F1-score in the Plant domain of MASAD, where we observe consistent improvements across all seven domains. These results highlight the effectiveness and scalability of the hierarchical attention-based fusion strategy for real-world aspect-based sentiment analysis tasks.
Ifakir et al. (Sat,) studied this question.