Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Aydın, Burak.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Özgür, Arzucan.
dc.contributor.author	Aydın, Burak.
dc.date.accessioned	2023-03-16T10:01:51Z
dc.date.available	2023-03-16T10:01:51Z
dc.date.issued	2014.
dc.identifier.other	CMPE 2014 A83
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12273
dc.description.abstract	The training data size is of utmost importance for statistical machine translation (SMT), since it a ects the training time, model size, decoding speed, as well as the system's overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this thesis, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on rst ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation lter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation is achieved. We compared our results with the translation model combination approaches and the best English-Turkish translation systems as well, then reported the improvements. Moreover, we implemented our system with dependency based language modeling in addition to n-gram based language modeling and reported comparable results.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh	Machine translating.
dc.subject.lcsh	Turkish language -- Machine translating.
dc.subject.lcsh	English language -- Machine translating.
dc.title	Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation
dc.format.pages	xi, 40 leaves ;