Abstract:
Verbal multiword expression (VMWE) identi cation is a challenging task for many natural language processing studies. In this study, sequence tagging approach accompanied with stochastic models and variants of IOB tagging scheme is used for VMWE identi cation. In the scope of this thesis, a VMWE annotated Turkish corpus is constructed as the rst part of the PARSEME shared task 1.1 which is constructing VMWE annotated corpora in many languages. Additionally, a multilingual system called Deep-BGT is developed as the second part of the shared task which is developing language-independent VMWE identi cation systems using the corpora constructed in the rst part. The Turkish corpus is one of the biggest corpora in the shared task. The training and test corpora that were published in the PARSEME shared task 1.0 are updated as the PARSEME shared task 1.1 training and development corpora according to the new guidelines. A new test corpus is constructed from scratch. Deep-BGT uses the bidirectional Long Short-Term Memory model with a Conditional Random Field layer on top (BiLSTM-CRF). To the best of our knowledge, this study is the rst one that employs the BiLSTM-CRF model for VMWE identi cation. Deep-BGT was ranked the second in the open track in terms of the general ranking metric. Moreover, a novel tagging scheme called bigappy-unicrossy is introduced to rise to the challenge of overlapping VMWEs. Finally, the VMWE identi cation system is advanced by evaluating a subset of hyperparameters which consists of tagging scheme, number of units, number of BiLSTM layers, and classi er. A comprehensive analysis of BiLSTM based architectures for multilingual identi cation of VMWEs is presented accordingly.