dc.description.abstract |
Speech retrieval is a recently emerging field of information retrieval, in which the information is spoken, instead of written. So far, spoken information retrieval has been studied in several languages. In this thesis, we concentrate on the retrieval of Turkish Broadcast News. We implement two tasks: Spoken Term Detection (STD) and Spoken Document Retrieval (SDR). Although they both combine Automatic Speech Recognition (ASR) and Information Retrieval (IR) techniques to retrieve spoken data, their main goals are different. STD retrieves specific occurrences and requires an exact match, while SDR retrieves related documents and cares more about context. Automatic transcription and retrieval of speech is more complicated in agglutinative languages because a standard size recognition vocabulary is able to cover only a limited portion of the language. A common solution is segmenting the words into subwords and using subwords units in recognition. We employed grammatical and statistical subword units in recognition and indexing for STD. Best scores are obtained via combining word and statistical subword based approaches. Word segmentation algorithms are also useful in SDR since stems bear the meaning and provide a better representation of context. Experiments showed that stemming improves SDR performance but the segmenting methods do not have a significant difference. We also studied language-independent ASR errors. Indexing the alternative ASR hypotheses, as well as the best one, was shown to be effective on the STD task. Results are presented on our Turkish Broadcast News Corpus. |
|