Abstract:
In this thesis we have compared four statistical language models for large vocabulary Turkish speech recognition. Turkish is an agglutinative language and has a productive morphotactics. This property of Turkish results in a vocabulary explosion and misestimation of N-gram probabilities while designing speech recognition systems. The solution is to parse the words, in order to get smaller base units that are capable of covering the language with relatively small vocabulary size. Three different ways of decomposing words into base units are described: Morpheme-based model, stem-ending-based model and syllable-based model. These models with the word-based model are compared with respect to vocabulary size, text coverage, bigram perplexity and speech recognition performance. We have constructed a Turkish text corpus of size 10 million words, containing various texts collected from the Web. These texts have been parsed into their morphemes, stems, endings and syllables and statistics of these base units are estimated. Finally we have performed speech recognition experiments with models constructed with these base units.