Archives and Documentation Center
Digital Archives

Bootstrapping a speech recognition system by using video text recognition

Show simple item record

dc.contributor Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor Saraçlar, Murat.
dc.contributor.author Som, Temuçin.
dc.date.accessioned 2023-03-16T10:17:19Z
dc.date.available 2023-03-16T10:17:19Z
dc.date.issued 2009.
dc.identifier.other EE 2009 S66
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12743
dc.description.abstract In the broadcast news for the hearing impaired, the information is conveyed by three modalities: speech, sign language and sliding video text. In this work, we propose an HMM-based sliding video text recognition (SVTR) system to generate automatic transcriptions of the speech in broadcast news for the hearing impaired. Then, we bootstrap an unsupervised acoustic model by using those automatic transcriptions. The sliding video text recognition system is trained by using minimal amount of video data (7 minutes). Well known speech processing techniques are applied to model and to recognize the sliding video text. Baseline system gives 2.2% word error rate over the video test set. Then character error analysis is provided and a character based language model is employed to correct the errors. Finally semi-supervised training method is applied and signi cant error reduction is achieved (2.2% ! 0.9%). An automatic speech recognition system is bootstrapped by using the output of the sliding video text recognizer as the transcriptions. The speech data is segmented automatically and aligned with the automatic transcriptions. An unsupervised acoustic model (U-AM) is trained with 83 videos (11 hours). 12.7% word error rate is achieved for U-AM with 200K language model. The Out of vocabulary (OOV) rates of the language models are decreased by adding the automatic transcriptions of the audio train set to the large text corpus and the e ect of OOV rate on system performance is investigated. Finally, we compared the U-AM performance with the supervised one which is built from the same acoustic training corpus with manual transcriptions. Supervised acoustic model performs only 0.4% better than the U-AM (12.7% ! 12.3%).
dc.format.extent 30cm.
dc.publisher Thesis (M.S.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2009.
dc.relation Includes appendices.
dc.relation Includes appendices.
dc.subject.lcsh Automatic speech recognition.
dc.title Bootstrapping a speech recognition system by using video text recognition
dc.format.pages xiv, 70 leaves;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account