A seq2seq transformer model for Turkish spelling correction

Batmaz, Şahin.

Arşiv ve Dokümantasyon Merkezi Dijital Arşivi Ana Sayfası
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
Öğe Göster

A seq2seq transformer model for Turkish spelling correction

Batmaz, Şahin.

URI: http://digitalarchive.boun.edu.tr/handle/123456789/19717

Tarih: 2022

Özet:

Natural language processing (NLP) is a fascinating area of artificial intelligence. It allows humans to interact with machines through natural language. There are two main concepts in NLP model architectures, namely input vectorization and contextual representation. The input vectorization process starts with tokenization, where there are three approaches: character-level, word-level, and subword-level. Word-level tok enization results in a large vocabulary, and in agglutinative languages such as Turkish, words derived from the same stem are treated as different words. This makes it difficult for NLP models to understand their relationships and the meaning of the morphological affixes. Furthermore, all NLP models suffer from a common problem: spelling errors in the data. In case of spelling errors, the misspelled tokens become completely different and the models cannot understand them. In this thesis, a character-level seq2seq trans former model is developed for spelling error correction. To train the model, a dataset for Turkish spelling correction is created by collecting correctly spelled Turkish sen tences and systematically adding spelling errors to them. Seq2seq models suffer from multiple decoding iterations and have high prediction time. To address this problem, a novel model architecture, one-step seq2seq transformer model, is proposed in which the transformer model predicts the outputs in one iteration. The proposed models are tested with the exact match criteria. The standard seq2seq model and the one-step seq2seq model achieved 68.64% and 42.69% accuracy, respectively. Finally, the stan dard seq2seq model makes predictions for 160 input characters in 8.47 seconds, while the one-step seq2seq model makes predictions for the same number of characters in 73 milliseconds on CPU and 28 milliseconds on GPU.

Tüm öğe kaydını göster