Abstract:
Video emotion estimation has been an active area of research in recent decades. This study area is crucial for personalized content delivery, video recommendation, video summarization, and inappropriate scene censorship. In this thesis, we study movie emotion estimation. The problem is explored with regards to feature extraction, feature summarization, feature selection, regression, and data synthesis. The audio and image features extracted from videos are Mel-frequency cepstral coefficients, hue saturation histogram, dense scale-invariant feature transform, facial action units, and the sixth fully connected layer’s feature of VGG network. The features are summarized via descriptive statistics functionals and Fisher vector encoding. Feature selection technique based on canonical correlation analysis is applied to the features. Extreme learning and support vector machines are used as regression techniques. We construct train and validation set examining movie scene and movie mood distributions in the dataset. We synthesis data for the minority classes in the unbalanced dataset. Feature and score level fusion techniques are applied to the best features. We use smoothing techniques for sudden changes of consecutive labels. Our approach is evaluated on the Emotional Impact of Movies Task’s dataset provided by Mediaeval 2017 organization; the movies of the dataset are selected from a challenging Liris-Accede dataset. Fusion of facial action units and hue saturation histogram features provide the best arousal results. Score fusion of sixth fully connected layer feature of VGG and hue saturation histogram achieve best valence results. We obtain good Pearson Correlation Coefficient results for the best valence result. In the best arousal model, some slopes in predictions are in line with slopes of actual label graph.