Abstract:
Speaker verification is the process of proving one’s identity through voice. A speaker verification mechanism can be deployed in a banking system, a call center, a forensics software or a mobile application for accessing sensitive data. Text independent speaker verification deals with cases with unrestricted spoken content. The change in phonetic content and duration of the speech is a challenge in this field. In this study, we propose a method for text independent speaker verification task. In the proposed method, phonemes of two utterances are recognized together with their boundaries. All acoustic spectral features are extracted and fused with recognized phonetic information. All phonemes are paired with the same ones from another utterance, and fused phonetic information are concatenated. This information is given to a feedforward fully connected neural network for the same/different speaker decision. In training and tests, public English speaker verification datasets are used. Effects of various feature, data and test conditions are investigated. Equal error rates ranging from 16.7% to 23.7% are observed for recordings with changing speech duration between 15 seconds and 1 second. When the whole recording is used 15.5% error is observed.