Arşiv ve Dokümantasyon Merkezi
Dijital Arşivi

Unsupervised learning of word alignments for statistical machine translation

Basit öğe kaydını göster

dc.contributor Ph.D. Program in Electrical and Electronic Engineering.
dc.contributor.advisor Saraçlar, Murat.
dc.contributor.advisor Sarıkaya, Ruhi.
dc.contributor.author Mermer, Coşkun.
dc.date.accessioned 2023-03-16T10:25:21Z
dc.date.available 2023-03-16T10:25:21Z
dc.date.issued 2019.
dc.identifier.other EE 2019 M47 PhD
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/13147
dc.description.abstract Word alignment is a crucial first step in learning statistical translation models. In this dissertation, we propose a Bayesian approach to unsupervised learning of word alignments by introducing a sparse prior on the parameters of IBM word alignment models. In the original approach, word translation probabilities are estimated using the expectation-maximization (EM) algorithm. In the proposed approach, they are random variables with a prior and are integrated out during inference, where collapsed Gibbs sampling is used. The inferred word alignments are evaluated in a statistical ma chine translation (SMT) setting, experimenting with several language pairs and sizes of corpora and comparing against the EM and variational Bayes (VB) methods. We show that Bayesian inference outperforms both EM and VB in the majority of test cases, effectively addresses the high-fertility rare word problem in EM and unaligned rare word problem in VB, achieves higher agreement and vocabulary coverage rates than both, and leads to smaller phrase tables. We also propose a method for un supervised learning of the optimal segmentation for SMT. We augment the original Morfessor monolingual segmentation model with a word alignment model so that the new model optimizes the posterior probability of the parallel training corpus according to a generative segmentation-translation model. In order to speed up computation, we propose an incremental method for approximate translation likelihood calculation and a parallelizable search algorithm, which improves the performance of even the mono lingual segmentation. We use the proposed method to segment the Turkish side in a Turkish-to-English SMT system and find that the bilingual model results in more intuitive segmentations but does not yield a further significant increase in BLEU scores.
dc.format.extent 30 cm.
dc.publisher Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2019.
dc.subject.lcsh Machine translating.
dc.title Unsupervised learning of word alignments for statistical machine translation
dc.format.pages xv, 95 leaves ;


Bu öğenin dosyaları

Bu öğe aşağıdaki koleksiyon(lar)da görünmektedir.

Basit öğe kaydını göster

Dijital Arşivde Ara


Göz at

Hesabım