Abstract:
In natural language processing tasks, representing a word is an important issue. After Bengio et al. introduced a simple neural network language model that learns word vector representations in 2003, representing words in continuous vector space has become more popular. Mikolov et al. introduced a method named word2vec and showed that word embedding could capture meaningful syntactic and semantic similarities in 2013. Many methods and implementations have been proposed for English since then. However, there are only a few studies on word representations in Turkish. In this study, we aimed to understand and analyze how word embedding models work on both Turkish and English. We focused on the word2vec word embedding model and tried to modify it to improve the quality of word representations. Additionally, we trained many models with di↵erent window sizes and dimensions. The impact of di↵erent configurations on the quality of word representations was analyzed both intrinsically and extrinsically. We reported the accuracy on word analogy tasks for intrinsic evaluation and word similarity tasks for extrinsic evaluation. Our results show that our proposed models perform better on most of the word analogy task categories for Turkish. We also showed that increasing window sizes and dimensions does not always a↵ect the accuracy in a positive direction. For some analogy and word similarity tasks, it a↵ects negatively.