Using crosslingual information for keyword search in low resource languages

Yusuf, Bolaji.

Arşiv ve Dokümantasyon Merkezi Dijital Arşivi Ana Sayfası
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
Öğe Göster

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.author	Yusuf, Bolaji.
dc.date.accessioned	2023-03-16T10:19:30Z
dc.date.available	2023-03-16T10:19:30Z
dc.date.issued	2018.
dc.identifier.other	EE 2018 Y87
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12933
dc.description.abstract	Keyword search (KWS) is a subtask of spoken content retrieval that aims to solve the problem of locating a written query within a large, unlabeled spoken doc ument. The dominant approach to KWS involves transcribing the document using an automatic speech recognition (ASR) system and conducting the search on indexes obtained from the ASR lattices. The large vocabulary continuous speech recognition (LVCSR) systems used to decode the document typically require enormous amounts of labeled data to give good recognition and, subsequently, search accuracy. Therefore, KWS models built for languages for with relatively little labeled training data need to contend with the deterioration in search performance that accompanies a decline in ASR performance. This deterioration is exacerbated by the increased incidence of search terms that are out of vocabulary (OOV) of the training data. One way of improving KWS performance in such a setting is to leverage information from other languages. In this work, we use a multilingual representation to build a vocabulary agnostic KWS model. The multilingual bottleneck (BN) representation, obtained from a neural network trained on the source languages, is used to train a metric learning based KWS engine in the target languages. Experiments on the low resource datasets from the IARPA Babel Program show the beneﬁts of using the proposed system as an alternative to, or in tandem with, more traditional multilingual models. In an ex tremely low resource setting, the performance of the proposed system exceed that of the baseline system (also trained with multilingual data). Furthermore, in a milder low resource setting, the proposed system performs better on OOV term retrieval than the baseline. In either setting, we show that combining the results from both systems yields a robustness against OOV terms and better overall performance.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.A.) - Bogazici University. Institute for Graduate Studies in the Social Sciences, 2018.
dc.subject.lcsh	Keyword searching.
dc.title	Using crosslingual information for keyword search in low resource languages
dc.format.pages	xiv, 67 leaves ;