dc.description.abstract |
Dependency parsing is an important step for many natural language processing (NLP) systems such as question answering and machine translation. Turkish, being a morphologically rich language and having a complex grammar, is challenging for au tomatic processing. Limited NLP tools and resources for Turkish make the task even more challenging. Data-driven deep learning models show promising performance in dependency parsing. Yet, the amount of data to train a data-driven dependency parser directly affects performance, and deep learning-based systems require extensive data to achieve good performance. In this thesis, we focused on Turkish dependency parsing and proposed two solutions to the challenges this task poses. First, we increased the size and quality of labeled data for Turkish dependency parsing. In this respect, we cre ated the BOUN Treebank by annotating 9,761 sentences. In addition, we re- annotated the IMST and PUD treebanks using the same annotation scheme. As a result, we presented the largest collection of Turkish treebanks with consistent annotation. Sec ond, we developed novel state-of-the-art dependency parsing models for Turkish as well as other low-resource languages. As our first parsing approach, we introduced a hybrid dependency parser where Turkish grammar rules and morphological features of words are integrated into the deep learning model. Despite the limited training data, the hybrid parser achieved higher success than the current methods for Turkish dependency parsing. As our second parsing approach, we proposed a deep dependency parser with semi-supervised enhancement. By conducting experiments on a number of low-resource languages besides Turkish, we achieved state-of-the-art results on all datasets. We have shown that deep learning-based models can be improved not only by additional training data, but also by integrating intelligently extracted information. |
|