Archives and Documentation Center
Digital Archives

Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation

Show simple item record

dc.contributor Graduate Program in Computer Engineering.
dc.contributor.advisor Özgür, Arzucan.
dc.contributor.author Aydın, Burak.
dc.date.accessioned 2023-03-16T10:01:51Z
dc.date.available 2023-03-16T10:01:51Z
dc.date.issued 2014.
dc.identifier.other CMPE 2014 A83
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12273
dc.description.abstract The training data size is of utmost importance for statistical machine translation (SMT), since it a ects the training time, model size, decoding speed, as well as the system's overall success. One of the challenges for developing SMT systems for languages with less resources is the limited sizes of the available training data. In this thesis, we propose an approach for expanding the training data by including parallel texts from an out-of-domain corpus. Selecting the best out-of-domain sentences for inclusion in the training set is important for the overall performance of the system. Our method is based on rst ranking the out-of-domain sentences using a language modeling approach, and then, including the sentences to the training set by using the vocabulary saturation lter technique. We evaluated our approach for the English-Turkish language pair and obtained promising results. Performance improvements of up to +0.8 BLEU points for the English-Turkish translation is achieved. We compared our results with the translation model combination approaches and the best English-Turkish translation systems as well, then reported the improvements. Moreover, we implemented our system with dependency based language modeling in addition to n-gram based language modeling and reported comparable results.
dc.format.extent 30 cm.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh Machine translating.
dc.subject.lcsh Turkish language -- Machine translating.
dc.subject.lcsh English language -- Machine translating.
dc.title Utilizing out-of-domain data through languaghe modeling based vocabulary saturation for English-Turkish machine translation
dc.format.pages xi, 40 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account