Attention modeling with temporal shift in sign language recognition

Çelimli, Ahmet Faruk.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Akarun, Lale.
dc.contributor.author	Çelimli, Ahmet Faruk.
dc.date.accessioned	2023-10-15T06:54:29Z
dc.date.available	2023-10-15T06:54:29Z
dc.date.issued	2022
dc.identifier.other	CMPE 2022 C46
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/19712
dc.description.abstract	Sign languages (SLs) are the main communication language of deaf people. They are visual languages that establish communication through multiple cues including hand gestures, upper-body movements and facial expressions. Sign language recognition (SLR) models have the potential to ease communication between hearing and deaf people. Advancements in deep learning and the increased availability of public datasets have led more researchers to study SLR. These advancements shifted solution methods for SLR from hand-crafted features to 2 Dimensional Convolutional Neural Network (2D CNN) models. Inadequacy of 2D CNNs on temporal modeling and 3D CNNs' ability of spatio- temporal modeling made 3D CNNs a popular choice. Despite its successful results, high computational costs and memory requirements of 3D CNNs created a need for alternative architectures. In this thesis, we propose an SLR model that uses 2D CNN as backbone and attention modeling with temporal shift. Usage of 2D CNN decreases the number of parameters and required memory size compared to its 3D CNN counterpart. In order to increase adaptability to other datasets and simplify the training process our model uses full frame RGB images instead of cropped images that focus on specific body parts of signers. Since communication in SL is established by using multiple visual cues at the same time or at different moments, the model must learn how these cues are collaborating with each other. While temporal shift modules give our 2D CNN backbone model the ability of temporal modeling, attention modules learn to focus on what, where and when in videos. We tested our model with BosphorusSign22k dataset which is a Turkish isolated SLR dataset. The proposed model achieves 92.97% classification accuracy. Our study shows that attention modeling with temporal shift on top of 2D CNN backbone gives competitive results in isolated SLR.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh	Sign language.
dc.subject.lcsh	Neural networks (Computer science)
dc.subject.lcsh	Deep learning (Machine learning)
dc.title	Attention modeling with temporal shift in sign language recognition
dc.format.pages	xv, 59 leaves