Statistical language models for large vocabulary Turkish speech recognition

Dutağacı, Helin.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Arslan, Levent M.
dc.contributor.author	Dutağacı, Helin.
dc.date.accessioned	2023-03-16T10:16:43Z
dc.date.available	2023-03-16T10:16:43Z
dc.date.issued	2002.
dc.identifier.other	EE 2002 D88
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12644
dc.description.abstract	In this thesis we have compared four statistical language models for large vocabulary Turkish speech recognition. Turkish is an agglutinative language and has a productive morphotactics. This property of Turkish results in a vocabulary explosion and misestimation of N-gram probabilities while designing speech recognition systems. The solution is to parse the words, in order to get smaller base units that are capable of covering the language with relatively small vocabulary size. Three different ways of decomposing words into base units are described: Morpheme-based model, stem-ending-based model and syllable-based model. These models with the word-based model are compared with respect to vocabulary size, text coverage, bigram perplexity and speech recognition performance. We have constructed a Turkish text corpus of size 10 million words, containing various texts collected from the Web. These texts have been parsed into their morphemes, stems, endings and syllables and statistics of these base units are estimated. Finally we have performed speech recognition experiments with models constructed with these base units.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institue for Graduate Studies in Science and Engineering, 2002.
dc.relation	Includes appendices.
dc.relation	Includes appendices.
dc.subject.lcsh	Automatic speech recognition -- Statistical methods.
dc.subject.lcsh	Turkish language -- Morphology.
dc.subject.lcsh	Turkish language -- Word formation.
dc.subject.lcsh	Turkish language -- Data processing.
dc.title	Statistical language models for large vocabulary Turkish speech recognition
dc.format.pages	xv, 89 leaves ;