On April 30, 2025, researchers Maxime Bouthors, Josep Crego, and François Yvon published Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data, introducing a method that enhances machine translation by effectively utilizing monolingual data.
The study introduces improved cross-lingual retrieval systems for retrieval-augmented neural machine translation (RANMT), leveraging monolingual target-side corpora instead of relying solely on bilingual data. By training with both sentence-level and word-level matching objectives, the researchers demonstrate enhanced translation performance compared to traditional methods using translation memories. Experiments across two RANMT architectures show significant improvements in controlled settings and real-world scenarios where monolingual resources far exceed parallel data availability. The new approach outperforms baseline systems and general-purpose cross-lingual retrievers, highlighting its effectiveness in utilizing underexploited monolingual resources for machine translation tasks.
In recent years, machine translation has made significant strides, yet challenges remain in accurately translating less frequent or ambiguous terms. A new study explores how retrieval-augmented neural machine translation (RAMNT) can address these limitations by integrating external resources to improve translation quality. The research compares two methods—fuzzy string matching and cross-lingual information retrieval (CLIR)—using FAISS, evaluating their performance on English-French and German-English language pairs.
Neural machine translation (NMT) systems have advanced considerably, but they often struggle with less frequent or ambiguous terms. Retrieval-augmented neural machine translation addresses this by integrating external resources to enhance the model’s output. The study introduces TM3-LevT, a Transformer-based NMT model that incorporates retrieval techniques to leverage bilingual and monolingual data effectively.
The research compares two retrieval methods: fuzzy string matching and cross-lingual information retrieval (CLIR). Fuzzy string matching uses edit distance (Levenshtein) and BM25 filtering to identify similar sentences in a translation memory. While computationally efficient, this method may miss semantically similar phrases that aren’t structurally identical.
In contrast, CLIR utilises FAISS to employ dense vector representations from multilingual models, capturing semantic similarities more effectively than fuzzy matching. However, this approach requires significant computational resources. Both methods were tested on datasets of varying sizes and with different levels of overlap in training data to assess their robustness.
The findings reveal that CLIR outperforms fuzzy string matching, particularly when dealing with larger datasets. This is attributed to CLIR’s ability to capture semantic nuances beyond mere structural similarities. Interestingly, the overlap in training data did not significantly impact performance, indicating that these techniques can be effective even with some shared data.
While CLIR offers superior translation quality due to its semantic understanding, it comes at a higher computational cost. On the other hand, fuzzy string matching provides a practical balance between speed and accuracy, especially when optimised with BM25 filtering. The study underscores the importance of choosing the appropriate retrieval technique based on specific needs and resources.
This research contributes valuable insights into enhancing machine translation systems, offering actionable recommendations for practitioners aiming to integrate retrieval techniques into their NMT workflows.
👉 More information
🗞 Improving Retrieval-Augmented Neural Machine Translation with Monolingual Data
🧠 DOI: https://doi.org/10.48550/arXiv.2504.21747
