Archives and Documentation Center
Digital Archives

A seq2seq transformer model for Turkish spelling correction

Show simple item record

dc.contributor Graduate Program in Computer Engineering.
dc.contributor.advisor Özgür, Arzucan.
dc.contributor.author Batmaz, Şahin.
dc.date.accessioned 2023-10-15T06:58:18Z
dc.date.available 2023-10-15T06:58:18Z
dc.date.issued 2022
dc.identifier.other CMPE 2022 B37
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/19717
dc.description.abstract Natural language processing (NLP) is a fascinating area of artificial intelligence. It allows humans to interact with machines through natural language. There are two main concepts in NLP model architectures, namely input vectorization and contextual representation. The input vectorization process starts with tokenization, where there are three approaches: character-level, word-level, and subword-level. Word-level tok enization results in a large vocabulary, and in agglutinative languages such as Turkish, words derived from the same stem are treated as different words. This makes it difficult for NLP models to understand their relationships and the meaning of the morphological affixes. Furthermore, all NLP models suffer from a common problem: spelling errors in the data. In case of spelling errors, the misspelled tokens become completely different and the models cannot understand them. In this thesis, a character-level seq2seq trans former model is developed for spelling error correction. To train the model, a dataset for Turkish spelling correction is created by collecting correctly spelled Turkish sen tences and systematically adding spelling errors to them. Seq2seq models suffer from multiple decoding iterations and have high prediction time. To address this problem, a novel model architecture, one-step seq2seq transformer model, is proposed in which the transformer model predicts the outputs in one iteration. The proposed models are tested with the exact match criteria. The standard seq2seq model and the one-step seq2seq model achieved 68.64% and 42.69% accuracy, respectively. Finally, the stan dard seq2seq model makes predictions for 160 input characters in 8.47 seconds, while the one-step seq2seq model makes predictions for the same number of characters in 73 milliseconds on CPU and 28 milliseconds on GPU.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh Natural language processing (Computer science)
dc.title A seq2seq transformer model for Turkish spelling correction
dc.format.pages xi, 57 leaves


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account