Morphologically motivated input variations in Turkish - English neural machine translation

Yirmibeşoğlu, Zeynep.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

Morphologically motivated input variations in Turkish - English neural machine translation

Yirmibeşoğlu, Zeynep.

URI: http://digitalarchive.boun.edu.tr/handle/123456789/12459

Date: 2021.

Abstract:

Success of neural networks in natural language processing has paved the way for neural machine translation (NMT), which rapidly became the mainstream approach in machine translation. Tremendous improvement in translation performance has been achieved with breakthroughs such as encoder-decoder networks, attention mechanism and Transformer architecture. However, the necessity of large amounts of parallel data for training an NMT system, and rare words in translation corpora are issues yet to be overcome. In this study, neural machine translation of the low-resource Turkish-English language pair is approached. State-of-the-art NMT architectures are employed and data augmentation methods that exploit monolingual corpora are used. The importance of input representation for the morphologically-rich Turkish language is pointed out, and a comprehensive analysis of linguistically and non-linguistically motivated input segmentation approaches has been made. Experiments on different input variations have proven the importance of morphologically motivated input seg mentation for the Turkish language that carries a rich morphology. Moreover, supe riority of the Transformer architecture over attentional encoder-decoder models has been shown for the Turkish-English language pair. Among the employed data aug mentation approaches, back-translation has proven to be the most effective, and the benefit of increasing amount of parallel data on translation quality is confirmed. This thesis demonstrates a comprehensive analysis on NMT architectures with different hy perparameters, data augmentation methods and input representation techniques, and proposes ways of tackling the low-resource setting of Turkish-English NMT.

Show full item record