Welcome to the repository designed based on FAIR principles for the experiments described in: "Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts".
You can read the paper on ArXiv, ResearchGate, Publisher's website.
Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.
We developed a Python tool that automates the process of domain-specific sentence selection using semantic similarity. It is especially useful when:
- You only have monolingual in-domain data
- You want to fine-tune models efficiently without massive retraining
- You need a lightweight, scalable solution
🔗 Visit the tool’s standalone repository →
The tool supports customizable selection size, sentence embedding models (e.g., SBERT), and single-GPU usage.
Check its README for installation, usage examples, and integration instructions.
| System | Link | System | Link | 
|---|---|---|---|
| Top1 | Download | Top1 | Download | 
| Top2+Top1 | Download | Top2 | Download | 
| Top3+Top2+... | Download | Top3 | Donwload | 
| Top4+Top3+... | Download | Top4 | Donwload | 
| Top5+Top4+... | Download | Top5 | Donwload | 
| Top6+Top5+... | Download | Top6 | Donwload | 
Note: Bandwidth for Git LFS of personal account is 1GB/month. If you're unable to download the models, follow this link.
Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.
Follow steps below to translate your sentences:
1. Install the Python package:
pip install --upgrade pip
pip install ctranslate22. Download models from our HF repository: You can do this manually or use the following python script:
import requests
url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)3. Convert the downloaded model:
ct2-opennmt-py-converter --model_path model_path --output_dir output_directory4. Translate tokenized inputs:
Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.
import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])or
import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")To customize the CTranslate2 functions, read this API document.
5. Detokenize the outputs:
Note: you need to detokenize the output with the same sentencepiece model as used in step 4.
tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok6. Remove the @@ tokens:
cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessdUse grep to check if @@ tokens removed successfully:
grep @@ output._file.detok.postprocessd - Javad Pourmostafa - Email, Website
- Dimitar Shterionov - Email, Website
- Pieter Spronck - Email, Website
If you find this repository helpful, feel free to cite our publication:
@article{Pourmostafa Roshan Sharami_Sterionov_Spronck_2021, 
title={Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts}, 
volume={11}, 
url={https://www.clinjournal.org/clinj/article/view/137}, 
journal={Computational Linguistics in the Netherlands Journal}, 
author={Pourmostafa Roshan Sharami, Javad and Sterionov, Dimitar and Spronck, Pieter}, 
year={2021}, 
month={Dec.}, 
pages={213–230} }}