Unsupervised learning of word alignments for statistical machine translation

Mermer, Coşkun.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
Ph.D. Theses
→
View Item

dc.contributor	Ph.D. Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.advisor	Sarıkaya, Ruhi.
dc.contributor.author	Mermer, Coşkun.
dc.date.accessioned	2023-03-16T10:25:21Z
dc.date.available	2023-03-16T10:25:21Z
dc.date.issued	2019.
dc.identifier.other	EE 2019 M47 PhD
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/13147
dc.description.abstract	Word alignment is a crucial ﬁrst step in learning statistical translation models. In this dissertation, we propose a Bayesian approach to unsupervised learning of word alignments by introducing a sparse prior on the parameters of IBM word alignment models. In the original approach, word translation probabilities are estimated using the expectation-maximization (EM) algorithm. In the proposed approach, they are random variables with a prior and are integrated out during inference, where collapsed Gibbs sampling is used. The inferred word alignments are evaluated in a statistical ma chine translation (SMT) setting, experimenting with several language pairs and sizes of corpora and comparing against the EM and variational Bayes (VB) methods. We show that Bayesian inference outperforms both EM and VB in the majority of test cases, eﬀectively addresses the high-fertility rare word problem in EM and unaligned rare word problem in VB, achieves higher agreement and vocabulary coverage rates than both, and leads to smaller phrase tables. We also propose a method for un supervised learning of the optimal segmentation for SMT. We augment the original Morfessor monolingual segmentation model with a word alignment model so that the new model optimizes the posterior probability of the parallel training corpus according to a generative segmentation-translation model. In order to speed up computation, we propose an incremental method for approximate translation likelihood calculation and a parallelizable search algorithm, which improves the performance of even the mono lingual segmentation. We use the proposed method to segment the Turkish side in a Turkish-to-English SMT system and ﬁnd that the bilingual model results in more intuitive segmentations but does not yield a further signiﬁcant increase in BLEU scores.
dc.format.extent	30 cm.
dc.publisher	Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2019.
dc.subject.lcsh	Machine translating.
dc.title	Unsupervised learning of word alignments for statistical machine translation
dc.format.pages	xv, 95 leaves ;