Bootstrapping a speech recognition system by using video text recognition

Som, Temuçin.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.author	Som, Temuçin.
dc.date.accessioned	2023-03-16T10:17:19Z
dc.date.available	2023-03-16T10:17:19Z
dc.date.issued	2009.
dc.identifier.other	EE 2009 S66
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12743
dc.description.abstract	In the broadcast news for the hearing impaired, the information is conveyed by three modalities: speech, sign language and sliding video text. In this work, we propose an HMM-based sliding video text recognition (SVTR) system to generate automatic transcriptions of the speech in broadcast news for the hearing impaired. Then, we bootstrap an unsupervised acoustic model by using those automatic transcriptions. The sliding video text recognition system is trained by using minimal amount of video data (7 minutes). Well known speech processing techniques are applied to model and to recognize the sliding video text. Baseline system gives 2.2% word error rate over the video test set. Then character error analysis is provided and a character based language model is employed to correct the errors. Finally semi-supervised training method is applied and signi cant error reduction is achieved (2.2% ! 0.9%). An automatic speech recognition system is bootstrapped by using the output of the sliding video text recognizer as the transcriptions. The speech data is segmented automatically and aligned with the automatic transcriptions. An unsupervised acoustic model (U-AM) is trained with 83 videos (11 hours). 12.7% word error rate is achieved for U-AM with 200K language model. The Out of vocabulary (OOV) rates of the language models are decreased by adding the automatic transcriptions of the audio train set to the large text corpus and the e ect of OOV rate on system performance is investigated. Finally, we compared the U-AM performance with the supervised one which is built from the same acoustic training corpus with manual transcriptions. Supervised acoustic model performs only 0.4% better than the U-AM (12.7% ! 12.3%).
dc.format.extent	30cm.
dc.publisher	Thesis (M.S.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2009.
dc.relation	Includes appendices.
dc.relation	Includes appendices.
dc.subject.lcsh	Automatic speech recognition.
dc.title	Bootstrapping a speech recognition system by using video text recognition
dc.format.pages	xiv, 70 leaves;