Archives and Documentation Center
Digital Archives

Improving text classification performance with the analysis of lexical dependencies and class-based feature selection

Show simple item record

dc.contributor Ph.D. Program in Computer Engineering.
dc.contributor.advisor Güngör, Tunga.
dc.contributor.author Özgür, Levent.
dc.date.accessioned 2023-03-16T10:13:32Z
dc.date.available 2023-03-16T10:13:32Z
dc.date.issued 2010.
dc.identifier.other CMPE 2010 O84 PhD
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12558
dc.description.abstract In this thesis, we present a comprehensive analysis of the feature extraction and feature selection techniques for the text classification problem in order to achieve more successful results using much smaller feature vector sizes. For feature extraction, 36 different lexical dependencies are included and analyzed independently in the feature vector as an extension to the standard bag-of-words approach. Feature selection analysis is twofold. In the first stage, pruning implementation is analyzed and optimal pruning levels are extracted with respect to dataset properties and feature variations (words, dependencies, combination of the leading dependencies). In the second stage, we compare the performance of corpus-based and class-based approaches for feature selection coverage and then, extend pruning implementation by the optimized class-based feature selection. For the final and most advanced test, we serialize the optimal use of the leading dependencies for each experimented dataset with the two stage (corpus and class-based) feature selection approach. For performance evaluation, we use the state-of-the-art measures for text classification problems: two different success score metrics and three different significance tests. With respect to these measures, the results reveal that for each extension in the methods, a corresponding significant improvement is obtained. The most advanced method combining the leading dependencies with optimal pruning levels and optimal number of class-based features mostly outperform the other methods in terms of success rates with reasonable feature sizes. To the best of our knowledge, this is the first study that makes such a detailed analysis on extracting individual dependencies and employing feature selection with two stage selection approach in text classification and more generally in text domain.
dc.format.extent 30cm.
dc.publisher Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2010.
dc.subject.lcsh Artificial intelligence.
dc.subject.lcsh Text processing (Computer science)
dc.title Improving text classification performance with the analysis of lexical dependencies and class-based feature selection
dc.format.pages xx, 97 leaves;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account