dc.description.abstract |
The technological advances in the last decades facilitated the easy transfer and storage of huge amounts of scanned soft documents. This improvement brings the challenge of automatically classifying big, unbalanced, multi-class, noisy and relatively short text data, which is the scope of this thesis. This study addresses the real world problem, classifying bank order documents of Yap Kredi Bank. A corpus of academic paper abstracts, which resembles the original problem in terms of class complexity and document length is also collected and used. Combinations of methods for balancing, pre-processing data, feature extraction, feature selection and classi cation are discussed in this study. The unbalanced data are balanced by sampling documents randomly or according to their noise and information content. For Optical Character Recognizer errors, rst the word is assessed as corrigible or incorrigible in terms of its potential to be corrected. For corrigible words, four methods are used for correction, which are domain speci c glossary based model, language model based Hidden Markov Model and normal or agressive sequential correction models. In order to minimize redundant data, Named Entity tagging, Morfessor and F5 stemming are used. Latent Dirichlet Allocation and Term Frequency Inverse Document Frequency features are used. To classify balanced classes, the best technique is Term Frequency Inverse Document Frequency features with Support Vector Machines, which is tested and proven for both the Yap Kredi Bank Orders and Academic Paper Abstracts datasets with up to 92% performance for 12 classes for the Yap Kredi Bank Orders Dataset. |
|