Abstract:
Languages with agglutinative or in ectional morphology have proven to be challenging for speech and language processing due to relatively large vocabulary sizes leading to a high number of out-of-vocabulary (OOV) words. In this thesis, we tackle with these challenges in automatic speech recognition (ASR) for Turkish which has an extremely productive in ectional and derivational morphology. First, we build the necessary tools and resources for Turkish, namely a nite-state morphological parser, a perceptron-based morphological disambiguator, and a text corpus collected from the world wide web. Second, we introduce two complementary language modeling approaches to alleviate the OOV word problem and to exploit morphology as a knowledge source. The first, morpholexical language model, is a generative n-gram model, where modeling units are lexical-grammatical morphemes instead of commonly used words or statistical sub-words. The second is a linear reranking model trained discriminatively with a variant of the perceptron algorithm, word error rate (WER) sensitive perceptron, using morpholexical and morphosyntactic features to rerank n-best candidates obtained with the generative model. We apply the proposed models in Turkish broadcast news transcription task and give experimental results. We also propose a novel approach for integrating morphology into an ASR system in the nite-state transducer framework as a knowledge source. The morpholexical model is highly e ective in alleviating the OOV problem and improves the WER over word and statistical sub-word models by 1.8% and 0.8% absolute, respectively. The discriminatively trained model further improves the WER of the system by 0.8% absolute. Finally, we present an algorithm for on-the-fly lattice rescoring with low-latency.