dc.contributor |
Graduate Program in Electrical and Electronic Engineering. |
|
dc.contributor.advisor |
Saraçlar, Murat. |
|
dc.contributor.author |
Köse, Öykü Deniz. |
|
dc.date.accessioned |
2023-03-16T10:20:43Z |
|
dc.date.available |
2023-03-16T10:20:43Z |
|
dc.date.issued |
2020. |
|
dc.identifier.other |
EE 2020 K78 |
|
dc.identifier.uri |
http://digitalarchive.boun.edu.tr/handle/123456789/12986 |
|
dc.description.abstract |
The amount of multimedia data has been increased rapidly in recent years. While this data growth enables multimodal neural network based studies, it has also resulted in a need for e cient storage and retrieval systems for multimodal data. In this thesis, di erent data fusion schemes are examined to see the bene ts of the use of di erent data sources. Proposed fusion schemes di er in their stages in which the data fusion is performed. Additionally, several representation learning methods are investigated for e cient data storage and retrieval systems. Representations are generated in such a way that they re ect the distance between the represented data segments according to a certain distance metric. A joint representation and distance metric learning scheme is also considered for a performance gain. Several deep neural network models are designed for representation learning and data fusion, and their performances are evaluated with the same-di erent word-discrimination and phone classi cation tasks, respectively. Experiments are performed on two di erent multimodal data sets; USCTIMIT rtMRI and Signed Turkish broadcast news. Outcomes of the experiments show that the data fusion indeed brings a performance improvement over unimodal approaches, and performing fusion in earlier stages yields better results than fusing the data in later stages. Additionally, the proposed methods for the representation learning outperform the corresponding baseline systems in the same-di erent worddiscrimination task. Therefore, generated representations of video and audio segments can be considered as an important step towards a fast cross-modal query-by-sign search system. |
|
dc.format.extent |
30 cm. |
|
dc.publisher |
Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2020. |
|
dc.subject.lcsh |
Multimodal user interfaces (Computer systems) |
|
dc.subject.lcsh |
Multimedia communications. |
|
dc.title |
Multimodal representation learning for synchronized speech and videos |
|
dc.format.pages |
xiv, 62 leaves ; |
|