Indexation, retrieval and decision techniques for spoken term detection

Can, Doğan.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.author	Can, Doğan.
dc.date.accessioned	2023-03-16T10:17:24Z
dc.date.available	2023-03-16T10:17:24Z
dc.date.issued	2010.
dc.identifier.other	EE 2010 C36
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12755
dc.description.abstract	Speech Retrieval (SR) systems aim to provide access to large multimedia archives that include a vast amount of spoken media like lecture videos, podcasts, news clips and audio books. To that end, SR integrates two well studied fields: Automatic Speech Recognition (ASR) and Information Retrieval (IR). In an ideal setup where ASR transcripts are on a par with manual transcripts, SR is nothing more than classical text retrieval applied on ASR output. However, ASR technology is far from that point when it comes to heterogeneous stacks of unconstrained, unorganized audio recorded in uncontrolled environments. Considering the domain of interest to the end-user – think of databases like "YouTube" –, it becomes immediately obvious that relying entirely on ASR transcripts is a not an option for SR. To minimize the effect of recognition errors, most SR systems are built upon ASR lattices where the oracle word error rates are much lower. In these systems, it is possible to retrieve overlapping hits for different queries since the index takes many alternative transcriptions into consideration for each spoken segment in the database. As a result, it becomes possible to retrieve matches that are omitted in the best hypotheses. However, this approach alone does not meet the open-vocabulary search objective held by most SR systems since after all we are limited to ASR vocabulary during retrieval. Utilizing sub-word (phone, graphone, morpheme) transcripts, or subword lattices for that matter, projects the word-level index/search/decide problem to a finer grained space where sub-word strings are now the object of search. In this subword universe, retrieval is partly freed from the chains of system vocabulary and we can retrieve out-of-vocabulary (OOV) query terms simply by searching the sub-word level ASR outputs. Lattice indexing and sub-word methods improve recall but they also stress the ranking/decision process by matching segments irrelevant to the query. As the decision threshold is lowered to retrieve more, a large number of false alarms come into play as a combined effect of lattices and sub-words. For that matter, it is increasingly important to develop effective decision strategies which provide better discrimination between actual hits and false alarms. Spoken Term Detection (STD) is a relatively new SR task which aims to locate exact matches to a given query term – a sequence of words in text form – in a large spoken database. In this thesis, we look for high-performing, low cost, efficient and reliable solutions to the various challenges of the STD task. Our methods include novel techniques for indexing ASR lattices, retrieving OOV words and ranking/thresholding candidate results in a general, efficient and mathematically sound retrieval framework.
dc.format.extent	30cm.
dc.publisher	Thesis (M.S.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2010.
dc.subject.lcsh	Automatic speech recognition.
dc.subject.lcsh	Information retrieval.
dc.title	Indexation, retrieval and decision techniques for spoken term detection
dc.format.pages	xix, 87 leaves;