Data Selection in NMT

Welcome to the repository designed based on FAIR principles for the experiments described in: "Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts".

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

You can read the paper on ArXiv, ResearchGate, Publisher's website.

Abstract

Continuously-growing data volumes lead to larger generic models. Specific use-cases are usually left out, since generic models tend to perform poorly in domain-specific cases. Our work addresses this gap with a method for selecting in-domain data from generic-domain (parallel text) corpora, for the task of machine translation. The proposed method ranks sentences in parallel general-domain data according to their cosine similarity with a monolingual domain-specific data set. We then select the top K sentences with the highest similarity score to train a new machine translation system tuned to the specific in-domain data. Our experimental results show that models trained on this in-domain data outperform models trained on generic or a mixture of generic and domain data. That is, our method selects high-quality domain-specific training instances at low computational cost and data size.

🛠️ Data Selection Tool (Now in a Separate Repository)

We developed a Python tool that automates the process of domain-specific sentence selection using semantic similarity. It is especially useful when:

You only have monolingual in-domain data
You want to fine-tune models efficiently without massive retraining
You need a lightweight, scalable solution

🔗 Visit the tool’s standalone repository →
The tool supports customizable selection size, sentence embedding models (e.g., SBERT), and single-GPU usage.
Check its README for installation, usage examples, and integration instructions.

Our Pre-trained models on Hugging Face

System	Link	System	Link
Top1	Download	Top1	Download
Top2+Top1	Download	Top2	Download
Top3+Top2+...	Download	Top3	Donwload
Top4+Top3+...	Download	Top4	Donwload
Top5+Top4+...	Download	Top5	Donwload
Top6+Top5+...	Download	Top6	Donwload

Note: Bandwidth for Git LFS of personal account is 1GB/month. If you're unable to download the models, follow this link.

How to use

Note: we ported the best checkpoints of trained models to the Hugging Face (HF). Since our models were trained by OpenNMT-py, it was not possible to employ them directly for inference on HF. To bypass this issue, we use CTranslate2– an inference engine for transformer models.

Follow steps below to translate your sentences:

1. Install the Python package:

pip install --upgrade pip
pip install ctranslate2

2. Download models from our HF repository: You can do this manually or use the following python script:

import requests

url = "Download Link"
model_path = "Model Path"
r = requests.get(url, allow_redirects=True)
open(model_path, 'wb').write(r.content)

3. Convert the downloaded model:

ct2-opennmt-py-converter --model_path model_path --output_dir output_directory

4. Translate tokenized inputs:

Note: the inputs should be tokenized by SentencePiece. You can also use tokenized version of IWSLT test sets.

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_batch([["▁H", "ello", "▁world", "!"]])

or

import ctranslate2
translator = ctranslate2.Translator("output_directory/")
translator.translate_file(input_file, output_file, batch_type= "tokens/examples")

To customize the CTranslate2 functions, read this API document.

5. Detokenize the outputs:

Note: you need to detokenize the output with the same sentencepiece model as used in step 4.

tools/detokenize.perl -no-escape -l fr \
< output_file \
> output_file.detok

6. Remove the @@ tokens:

cat output_file.detok | sed -E 's/(@@)|(@@ )|(@@ ?$)//g' \
> output._file.detok.postprocessd

Use grep to check if @@ tokens removed successfully:

grep @@ output._file.detok.postprocessd

Authors

Javad Pourmostafa - Email, Website
Dimitar Shterionov - Email, Website
Pieter Spronck - Email, Website

Cite the paper

If you find this repository helpful, feel free to cite our publication:

@article{Pourmostafa Roshan Sharami_Sterionov_Spronck_2021, 
title={Selecting Parallel In-domain Sentences for Neural Machine Translation Using Monolingual Texts}, 
volume={11}, 
url={https://www.clinjournal.org/clinj/article/view/137}, 
journal={Computational Linguistics in the Netherlands Journal}, 
author={Pourmostafa Roshan Sharami, Javad and Sterionov, Dimitar and Spronck, Pieter}, 
year={2021}, 
month={Dec.}, 
pages={213–230} }}

Name		Name	Last commit message	Last commit date
Latest commit History 137 Commits
Data-Table1		Data-Table1
Selected-data-with-mixing		Selected-data-with-mixing
Selected-data-without-mixing		Selected-data-without-mixing
Tools		Tools
Tools_DS		Tools_DS
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Selection in NMT

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

Abstract

🛠️ Data Selection Tool (Now in a Separate Repository)

Our Pre-trained models on Hugging Face

How to use

Authors

Cite the paper

About

Uh oh!

Releases 1

Packages

Uh oh!

Languages

License

JoyeBright/DataSelection-NMT

Folders and files

Latest commit

History

Repository files navigation

Data Selection in NMT

The paper got accepted on Dec 6, 2021 and got published on Feb, 2022.

Abstract

🛠️ Data Selection Tool (Now in a Separate Repository)

Our Pre-trained models on Hugging Face

How to use

Authors

Cite the paper

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Languages

Packages