Multimodal representation learning for synchronized speech and videos

Köse, Öykü Deniz.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.author	Köse, Öykü Deniz.
dc.date.accessioned	2023-03-16T10:20:43Z
dc.date.available	2023-03-16T10:20:43Z
dc.date.issued	2020.
dc.identifier.other	EE 2020 K78
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12986
dc.description.abstract	The amount of multimedia data has been increased rapidly in recent years. While this data growth enables multimodal neural network based studies, it has also resulted in a need for e cient storage and retrieval systems for multimodal data. In this thesis, di erent data fusion schemes are examined to see the bene ts of the use of di erent data sources. Proposed fusion schemes di er in their stages in which the data fusion is performed. Additionally, several representation learning methods are investigated for e cient data storage and retrieval systems. Representations are generated in such a way that they re ect the distance between the represented data segments according to a certain distance metric. A joint representation and distance metric learning scheme is also considered for a performance gain. Several deep neural network models are designed for representation learning and data fusion, and their performances are evaluated with the same-di erent word-discrimination and phone classi cation tasks, respectively. Experiments are performed on two di erent multimodal data sets; USCTIMIT rtMRI and Signed Turkish broadcast news. Outcomes of the experiments show that the data fusion indeed brings a performance improvement over unimodal approaches, and performing fusion in earlier stages yields better results than fusing the data in later stages. Additionally, the proposed methods for the representation learning outperform the corresponding baseline systems in the same-di erent worddiscrimination task. Therefore, generated representations of video and audio segments can be considered as an important step towards a fast cross-modal query-by-sign search system.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2020.
dc.subject.lcsh	Multimodal user interfaces (Computer systems)
dc.subject.lcsh	Multimedia communications.
dc.title	Multimodal representation learning for synchronized speech and videos
dc.format.pages	xiv, 62 leaves ;