Abstract:
Automatic analysis of human behavior has been a difficult problem due to noise, environmental differences and lack of annotation. While lab-controlled data provides an easier learning experiment, “in the wild” datasets require systems complex enough to fit to unseen data, at the same time, deal with the problem of overlearning. In this thesis, we propose a fast and robust multimodal system that analyzes humans from facial images, videos and voice. We extract dense appearance descriptors as well as Deep Convolutional Neural Network (DCNN) features from the faces and we train kernel Extreme Learning Machine (ELM) classifiers, which are then combined by various fusion schemes. We apply our pipeline to a number of affective and biometric challenges and we show that ELM provides fast and accurate learning compared to traditional learning methods. We also show that multimodal fusion and DCNN finetuning improves the accuracy in almost all tasks. Our method has ranked 2nd in the Emotion Recognition in the Wild (EmotiW) challenge and 1st in the second round of ChaLearn Apparent Personality Analysis from First Impressions (FI) challenge as well as the ChaLearn Job Candidate Screening (JCS) challenge. Our results show that using extreme learning machine, efficient learning can be performed in terms of both time and computational complexity while preserving high performance.