Automatic topic categorization of Turkish Faxed Bank documents in the presence of OCR errors

Öztürk, Seçil.

Arşiv ve Dokümantasyon Merkezi Dijital Arşivi Ana Sayfası
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Elektrik- Elektronik Mühendisliği
→
M.S. Theses
→
Öğe Göster

dc.contributor	Graduate Program in Electrical and Electronic Engineering.
dc.contributor.advisor	Saraçlar, Murat.
dc.contributor.advisor	Sankur, Bülent.
dc.contributor.author	Öztürk, Seçil.
dc.date.accessioned	2023-03-16T10:18:35Z
dc.date.available	2023-03-16T10:18:35Z
dc.date.issued	2014.
dc.identifier.other	EE 2014 O87
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12870
dc.description.abstract	The technological advances in the last decades facilitated the easy transfer and storage of huge amounts of scanned soft documents. This improvement brings the challenge of automatically classifying big, unbalanced, multi-class, noisy and relatively short text data, which is the scope of this thesis. This study addresses the real world problem, classifying bank order documents of Yap Kredi Bank. A corpus of academic paper abstracts, which resembles the original problem in terms of class complexity and document length is also collected and used. Combinations of methods for balancing, pre-processing data, feature extraction, feature selection and classi cation are discussed in this study. The unbalanced data are balanced by sampling documents randomly or according to their noise and information content. For Optical Character Recognizer errors, rst the word is assessed as corrigible or incorrigible in terms of its potential to be corrected. For corrigible words, four methods are used for correction, which are domain speci c glossary based model, language model based Hidden Markov Model and normal or agressive sequential correction models. In order to minimize redundant data, Named Entity tagging, Morfessor and F5 stemming are used. Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency features are used. To classify balanced classes, the best technique is Term Frequency Inverse Document Frequency features with Support Vector Machines, which is tested and proven for both the Yap Kredi Bank Orders and Academic Paper Abstracts datasets with up to 92% performance for 12 classes for the Yap Kredi Bank Orders Dataset.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh	Adaptive control systems.
dc.subject.lcsh	Automatic control.
dc.subject.lcsh	Control theory.
dc.title	Automatic topic categorization of Turkish Faxed Bank documents in the presence of OCR errors
dc.format.pages	xv, 98 leaves ;