What question did this study set out to answer?

The aim is to develop a hybrid approach for enhancing multilingual machine translation without relying on parallel corpora.

February 14, 2026Open Access

A Hybrid Word and Sentence Alignment Approach for Unsupervised Multilingual Machine Translation Using Pre-Trained Cross-Lingual Encoder

Key Points

The aim is to develop a hybrid approach for enhancing multilingual machine translation without relying on parallel corpora.
Utilized pre-trained cross-lingual encoders (XLM-R) for translation tasks
Constructed pseudo-parallel corpora combining word-by-word and contextual translations
Employed VecMap and FastText for word-level alignment
Implemented Adversarial Contrastive Learning with Hard Negative Mining for sentence alignment
Achieved an average of +1.2 BLEU in bilingual and +0.9 BLEU in multilingual settings compared to state-of-the-art models
Outperformed existing models in translating low-resource Indian languages with an average of +0.7 BLEU
Demonstrated effectiveness in zero-shot and few-shot translation scenarios

Abstract

The lack of parallel corpora remains challenging for multilingual neural machine translation (MNMT), particularly for low-resource languages. This paper presents an unsupervised framework to utilize pre-trained cross-lingual encoders (XLM-R) in an unsupervised way and generates high-quality translations using monolingual corpora and bilingual dictionaries. The proposed method constructs pseudo-parallel corpora by combining word-by-word translation using bilingual dictionaries with contextual refinement via masked language modeling (MLM). To improve alignment quality, we propose a two-tier representation strategy: (1) word-level alignment that combines VecMap with FastText embeddings to address out-of-vocabulary (OOV) terms and capture morphological variations. (2) Sentence-level alignment using Adversarial Contrastive Learning (ACL) enhanced with Hard Negative Mining (HNM) to build semantically robust and discriminative sentence embeddings. Experimental results on the FLORES-101 dataset demonstrate that the proposed model outperforms the existing state-of-the-art models, with an average of +1.2 BLEU in bilingual settings and +0.9 BLEU in multilingual settings. Furthermore, the proposed model is evaluated on 4 low-resource Indian languages (e.g., Hindi, Urdu, Telugu, and Bengali), and it outperforms the state-of-the-art models with an average of +0.7 in bilingual and multilingual settings. Finally, evaluation in zero-shot and few-shot settings confirms the proposed approach’s robustness and generalization, demonstrating an effective solution for multilingual translation without using parallel corpora.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Ali et al. (Thu,) studied this question.

synapsesocial.com/papers/699011932ccff479cfe58547 https://doi.org/https://doi.org/10.1145/3796235

Bookmark

View Full Paper