The paper proposes a solution for generating image captions in both formal and conversational registers. This study is motivated by the need for educational tools to assist non-native speakers in mastering colloquial Russian. The methodology employs a multimodal encoder–decoder ensemble architecture, in which a pre-trained ResNet-152 Convolutional Neural Network serves as the encoder, and an LSTM network functions as the decoder. The captioning performance is further enhanced by incorporating the Bahdanau attention mechanism. To facilitate training, the authors constructed a proprietary dataset derived from MS COCO, which was translated and stylistically adapted via the GigaChat large language model. During ensemble construction, ruCLIPScore is utilized to select the most effective model configurations. Experimental results indicate that the ensemble significantly outperforms its individual constituent models according to ruCLIPScore and can produce captions with stylistic diversity across registers.
Building similarity graph...
Analyzing shared references across papers
Loading...
M. A. Privalov
A. S. Kozharinov
Doklady Mathematics
National University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Privalov et al. (Mon,) studied this question.
synapsesocial.com/papers/69c2295caeb5a845df0d3bbc — DOI: https://doi.org/10.1134/s1064562425700528