A comprehensive analysis of subword tokenizers for morphologically rich languages

Erkaya, Erencan.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Güngör, Tunga.
dc.contributor.author	Erkaya, Erencan.
dc.date.accessioned	2023-10-15T06:58:18Z
dc.date.available	2023-10-15T06:58:18Z
dc.date.issued	2022
dc.identifier.other	CMPE 2022 E75
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/19718
dc.description.abstract	Transformer language models have paved the way for outstanding achievements on a wide variety of natural language processing tasks. The first step in transformer models is dividing the input into tokens. Over the years, various tokenization ap proaches have emerged. These approaches have further evolved from character and word-level representations to subword-level representations. However, the impact of tokenization on models performance has not been thoroughly discussed, especially for morphologically rich languages. In this thesis, we comprehensively analyze subword tokenizers for Turkish, which is a highly inflected and morphologically rich language. We define various metrics to evaluate how well tokenizers encode Turkish morphol ogy. Also, we examine how the tokenizer parameters like vocabulary and corpus size change the characteristics of tokenizers. Additionally, we propose a new tokenizer for agglutinative and morphologically rich languages. We demonstrate that our tokenizer reduces overall perplexity and enables better generalization performance. Downstream task experiments show that morphology supervision in tokenization improves model performance.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh	Natural language processing (Computer science)
dc.subject.lcsh	Transformer language models.
dc.title	A comprehensive analysis of subword tokenizers for morphologically rich languages
dc.format.pages	xv, 66 leaves