MITRE 913M
Description
MITRE (Multilingual Translation with Registers) is a multilingual, decoder-only model designed for many-to-many translation tasks.
The technology, i.e., registering, is introduced in our paper.
This repository allows you employ our pre-trained model for inference. If you want to reproduce the data mining and training, please refer to this repository.
The model supports direct translation across 552 directions for 24 languages spanning over 5 language families.
You can use our models directly via the transformers
libs.
An alternative version of MITRE with 466M parameters is also available in this repository.
Usages
Before get tokenizer, you should run pip install sentencepiece
at first.
You can simply call the tokenizer and the model by
from transformers import AutoModel, AutoTokenizer
# you can switch the name to "naist-nlp/mitre_466m"
tokenizer = AutoTokenizer.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True, use_fast=False)
model = AutoModel.from_pretrained("naist-nlp/mitre_913m", trust_remote_code=True)
To locally use this model and check the codes, you can clone this hub, then
from mitre_913m.tokenization_mitre import MitreTokenizer
from mitre_913m.modeling_mitre import MitreForConditionalGeneration
tokenizer = MitreTokenizer.from_pretrained("mitre_913m")
model = MitreForConditionalGeneration.from_pretrained("mitre_913m")
After get the objects of the model and the tokenizer, we can do translation.
english_text = "I have a red apple."
chinese_text = "我有一个红苹果。"
model.half() # recommended
model.eval()
# Translating from one or several sentences to a sole language
src_tokens = tokenizer.encode_source_tokens_to_input_ids([english_text, ], target_language="zh")
# Translating from one or several sentences to corresponding languages
# src_tokens = tokenizer.encode_source_tokens_to_input_ids_with_different_tags([english_text, english_text, ], target_languages_list=["de", "zh", ])
generated_tokens = model.generate(src_tokens.cuda())
results = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(results)
# results
# de: Ich habe einen roten Apfel.
# zh: 我有一个红苹果。
# For training
# 1. The difference between tgt_tokens and labels is that the eos_tokens are moved to the right side.
# 2. We recommend using 'tokenizer.encode_target_tokens_to_labels' instead of modifying tgt_tokens,
# because 'tokenizer.encode_target_tokens_to_input_ids' has pads.
# 3. You can refer to our code for detailed implementation.
# tgt_tokens = tokenizer.encode_target_tokens_to_input_ids(chinese_text)
# labels = tokenizer.encode_target_tokens_to_labels(chinese_text)
Notes
We basically follow the style of M2M, however, we make some necessary improvements to reduce cost in generation.
You can refer to the codes of 'generate()' in modeling_mitre.py for much more details.
Moreover, we have a plan to implement FlashAttention V2 to further boost our models, which will be updated as soon as possible.
Languages covered
Germanic: English (en), German (de), Dutch; Flemish (nl), Swedish (sv), Danish (da), Afrikaans (af)
Romance: French (fr), Spanish (es), Italian (it), Portuguese (pt), Romanian; Moldavian; Moldovan (ro)
Slavic: Russian (ru), Czech (cs), Polish (pl), Bulgarian (bg), Ukrainian (uk)
Malayo-Polynesian: Indonesian (id), Malay (ms), Javanese (jv), Tagalog;Filipino (tl)
Asian*: Chinese (zh), Japanese (ja), Korean (ko), Vietnamese (vi)
BibTeX entry and citation info
@misc{qu2025registeringsourcetokenstarget,
title={Registering Source Tokens to Target Language Spaces in Multilingual Neural Machine Translation},
author={Zhi Qu and Yiran Wang and Jiannan Mao and Chenchen Ding and Hideki Tanaka and Masao Utiyama and Taro Watanabe},
year={2025},
eprint={2501.02979},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2501.02979},
}
- Downloads last month
- 115