Archives and Documentation Center
Digital Archives

Automatic topic categorization of Turkish Faxed Bank documents in the presence of OCR errors

Show simple item record

dc.contributor Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor Saraçlar, Murat.
dc.contributor.advisor Sankur, Bülent.
dc.contributor.author Öztürk, Seçil.
dc.date.accessioned 2023-03-16T10:18:35Z
dc.date.available 2023-03-16T10:18:35Z
dc.date.issued 2014.
dc.identifier.other EE 2014 O87
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12870
dc.description.abstract The technological advances in the last decades facilitated the easy transfer and storage of huge amounts of scanned soft documents. This improvement brings the challenge of automatically classifying big, unbalanced, multi-class, noisy and relatively short text data, which is the scope of this thesis. This study addresses the real world problem, classifying bank order documents of Yap Kredi Bank. A corpus of academic paper abstracts, which resembles the original problem in terms of class complexity and document length is also collected and used. Combinations of methods for balancing, pre-processing data, feature extraction, feature selection and classi cation are discussed in this study. The unbalanced data are balanced by sampling documents randomly or according to their noise and information content. For Optical Character Recognizer errors, rst the word is assessed as corrigible or incorrigible in terms of its potential to be corrected. For corrigible words, four methods are used for correction, which are domain speci c glossary based model, language model based Hidden Markov Model and normal or agressive sequential correction models. In order to minimize redundant data, Named Entity tagging, Morfessor and F5 stemming are used. Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency features are used. To classify balanced classes, the best technique is Term Frequency Inverse Document Frequency features with Support Vector Machines, which is tested and proven for both the Yap Kredi Bank Orders and Academic Paper Abstracts datasets with up to 92% performance for 12 classes for the Yap Kredi Bank Orders Dataset.
dc.format.extent 30 cm.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh Adaptive control systems.
dc.subject.lcsh Automatic control.
dc.subject.lcsh Control theory.
dc.title Automatic topic categorization of Turkish Faxed Bank documents in the presence of OCR errors
dc.format.pages xv, 98 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account