BART-Based Bidirectional Mon-English Translation Using Custom Vocabulary Mapping and Sentence Piece BPE
Abstract
Machine Translation between Mon and English and the reverse is very challenging but also a valuable area of research in order to enable communication and access to information for Mon language speaker. The Mon language itself is among the oldest languages within the Austroasiatic language family and is morphologically rich with complex syllable and tone structures, making translation even harder. Continual research and development are necessary to achieve higher levels of its machine translation quality for these language pairs. Tokenization approaches for bidirectional Mon-English neural machine translation (NMT) using BART are described in this paper. Two tokenization settings are investigated in this research. In the first setting, there is a specially done vocabulary mapping to the Mon language at the syllable level and English employs the BART tokenizer. In the second setting, Sentence Piece BPE is used for the Mon language but English again utilizes the BART tokenizer. Experiments are conducted to evaluate the performance of two tokenization techniques for bidirectional Mon-English machine translation using BART. With a custom vocabulary mapping for Mon, the model achieves a BLEU score of 30.14 for Mon-to-English and 25.43 for English-to-Mon translation. Compared to this, with Sentence Piece BPE for Mon, there is improved performance with BLEU scores of 41.03 for Mon-to-English and 26.97 for English-to-Mon translation. The impact of tokenization choice on translation quality is demonstrated by these findings and better performance is achieved with Sentence Piece BPE compared to syllable-level customized vocabulary mapping for Mon-English translation tasks.
Keywords - Machine Translation, BART, Mon Language, Sentence Piece BPE