Archives and Documentation Center
Digital Archives

Multimodal representation learning for synchronized speech and videos

Show simple item record

dc.contributor Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor Saraçlar, Murat.
dc.contributor.author Köse, Öykü Deniz.
dc.date.accessioned 2023-03-16T10:20:43Z
dc.date.available 2023-03-16T10:20:43Z
dc.date.issued 2020.
dc.identifier.other EE 2020 K78
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12986
dc.description.abstract The amount of multimedia data has been increased rapidly in recent years. While this data growth enables multimodal neural network based studies, it has also resulted in a need for e cient storage and retrieval systems for multimodal data. In this thesis, di erent data fusion schemes are examined to see the bene ts of the use of di erent data sources. Proposed fusion schemes di er in their stages in which the data fusion is performed. Additionally, several representation learning methods are investigated for e cient data storage and retrieval systems. Representations are generated in such a way that they re ect the distance between the represented data segments according to a certain distance metric. A joint representation and distance metric learning scheme is also considered for a performance gain. Several deep neural network models are designed for representation learning and data fusion, and their performances are evaluated with the same-di erent word-discrimination and phone classi cation tasks, respectively. Experiments are performed on two di erent multimodal data sets; USCTIMIT rtMRI and Signed Turkish broadcast news. Outcomes of the experiments show that the data fusion indeed brings a performance improvement over unimodal approaches, and performing fusion in earlier stages yields better results than fusing the data in later stages. Additionally, the proposed methods for the representation learning outperform the corresponding baseline systems in the same-di erent worddiscrimination task. Therefore, generated representations of video and audio segments can be considered as an important step towards a fast cross-modal query-by-sign search system.
dc.format.extent 30 cm.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2020.
dc.subject.lcsh Multimodal user interfaces (Computer systems)
dc.subject.lcsh Multimedia communications.
dc.title Multimodal representation learning for synchronized speech and videos
dc.format.pages xiv, 62 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account