Abstract:
Sign languages (SLs) are the main communication language of deaf people. They are visual languages that establish communication through multiple cues including hand gestures, upper-body movements and facial expressions. Sign language recognition (SLR) models have the potential to ease communication between hearing and deaf people. Advancements in deep learning and the increased availability of public datasets have led more researchers to study SLR. These advancements shifted solution methods for SLR from hand-crafted features to 2 Dimensional Convolutional Neural Network (2D CNN) models. Inadequacy of 2D CNNs on temporal modeling and 3D CNNs' ability of spatio- temporal modeling made 3D CNNs a popular choice. Despite its successful results, high computational costs and memory requirements of 3D CNNs created a need for alternative architectures. In this thesis, we propose an SLR model that uses 2D CNN as backbone and attention modeling with temporal shift. Usage of 2D CNN decreases the number of parameters and required memory size compared to its 3D CNN counterpart. In order to increase adaptability to other datasets and simplify the training process our model uses full frame RGB images instead of cropped images that focus on specific body parts of signers. Since communication in SL is established by using multiple visual cues at the same time or at different moments, the model must learn how these cues are collaborating with each other. While temporal shift modules give our 2D CNN backbone model the ability of temporal modeling, attention modules learn to focus on what, where and when in videos. We tested our model with BosphorusSign22k dataset which is a Turkish isolated SLR dataset. The proposed model achieves 92.97% classification accuracy. Our study shows that attention modeling with temporal shift on top of 2D CNN backbone gives competitive results in isolated SLR.