Özet:
Text-to-speech (TTS) systems have been an assisting technology since the 1970s. Although commercial use has begun decades ago, synthetic speech quality is still not as good as recorded speech. One particular subject of this field focused by this study is the speaker adaptation in TTS systems. Speaker adaptation is the task of modifying a given TTS model such that the modified model synthesizes speech samples with the voice characteristic of a desired speaker. In this study, deep neural network (DNN) based novel speaker adaptation techniques incorporating transfer learning methods are presented. We replaced the high dimensional speaker embeddings with few dimensional vectors using clustering methods. Objective results indicate significant improvement to the adaptation performance compared to baseline techniques in addition to a significant drop in the number of parameters. The second aspect of this study is the speaker adaptation performed on DNN-based postfiltering methods. The subjective results show that the adaptation of postfiltering increases the similarity of synthetic speech to the desired speaker’s voice although no significant improvement in quality is observed. The techniques proposed in this study are independent of the choice of the DNN architecture and speaker embedding, thus, can be extended and used for experiments of relevant fields such as speech recognition in the future.