The lack of parallel corpora remains challenging for multilingual neural machine translation (MNMT), particularly for low-resource languages. This paper presents an unsupervised framework to utilize pre-trained cross-lingual encoders (XLM-R) in an unsupervised way and generates high-quality translations using monolingual corpora and bilingual dictionaries. The proposed method constructs pseudo-parallel corpora by combining word-by-word translation using bilingual dictionaries with contextual refinement via masked language modeling (MLM). To improve alignment quality, we propose a two-tier representation strategy: (1) word-level alignment that combines VecMap with FastText embeddings to address out-of-vocabulary (OOV) terms and capture morphological variations. (2) Sentence-level alignment using Adversarial Contrastive Learning (ACL) enhanced with Hard Negative Mining (HNM) to build semantically robust and discriminative sentence embeddings. Experimental results on the FLORES-101 dataset demonstrate that the proposed model outperforms the existing state-of-the-art models, with an average of +1.2 BLEU in bilingual settings and +0.9 BLEU in multilingual settings. Furthermore, the proposed model is evaluated on 4 low-resource Indian languages (e.g., Hindi, Urdu, Telugu, and Bengali), and it outperforms the state-of-the-art models with an average of +0.7 in bilingual and multilingual settings. Finally, evaluation in zero-shot and few-shot settings confirms the proposed approach’s robustness and generalization, demonstrating an effective solution for multilingual translation without using parallel corpora.
Ali et al. (Thu,) studied this question.