Abstract:
A great deal of research in computer vision community has gone into action and event recognition studies. Automatic video understanding for actions are crucial for application areas such as video indexing, surveillance and video summarization. In this thesis, we explore action and event recognition on RGB videos both in terms of feature extraction and classi cation. We propose a novel approach for large-scale action recognition in a realistic setting. After reviewing the technical background about recent popular video description methods, we present our approach in which improved dense trajectory features in combination with Fisher vector encoding are fed to extreme learning machine classi er. It is shown that extreme learning machine provides a fast and accurate alternative to other traditional classi ers such as support vector machines. Additionally, we investigate the usability of some mid-level features that we introduce to encode information about human part regions. We extensively study each step of our pipeline in a comparative manner. We evaluate our approach on recently published benchmarks which were introduced as challenge datasets: UCF101, THUMOS 2014 and ChaLearn Looking at People 2014 Track 2. Videos in the rst dataset contain cropped actions while the ones in the last two datasets are temporally untrimmed, introducing more challenge. On 102 action classes of THUMOS 2014 dataset, we achieve 63.37% mean average precision using the challenge protocol, which has ranked 3rd among other participants. Our results show that, using extreme learning machine, e cient learning can be performed in terms of both time and computational complexity while preserving high performance.