Improving text classification performance with the analysis of lexical dependencies and class-based feature selection

Özgür, Levent.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
Ph.D. Theses
→
View Item

dc.contributor	Ph.D. Program in Computer Engineering.
dc.contributor.advisor	Güngör, Tunga.
dc.contributor.author	Özgür, Levent.
dc.date.accessioned	2023-03-16T10:13:32Z
dc.date.available	2023-03-16T10:13:32Z
dc.date.issued	2010.
dc.identifier.other	CMPE 2010 O84 PhD
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12558
dc.description.abstract	In this thesis, we present a comprehensive analysis of the feature extraction and feature selection techniques for the text classification problem in order to achieve more successful results using much smaller feature vector sizes. For feature extraction, 36 different lexical dependencies are included and analyzed independently in the feature vector as an extension to the standard bag-of-words approach. Feature selection analysis is twofold. In the first stage, pruning implementation is analyzed and optimal pruning levels are extracted with respect to dataset properties and feature variations (words, dependencies, combination of the leading dependencies). In the second stage, we compare the performance of corpus-based and class-based approaches for feature selection coverage and then, extend pruning implementation by the optimized class-based feature selection. For the final and most advanced test, we serialize the optimal use of the leading dependencies for each experimented dataset with the two stage (corpus and class-based) feature selection approach. For performance evaluation, we use the state-of-the-art measures for text classification problems: two different success score metrics and three different significance tests. With respect to these measures, the results reveal that for each extension in the methods, a corresponding significant improvement is obtained. The most advanced method combining the leading dependencies with optimal pruning levels and optimal number of class-based features mostly outperform the other methods in terms of success rates with reasonable feature sizes. To the best of our knowledge, this is the first study that makes such a detailed analysis on extracting individual dependencies and employing feature selection with two stage selection approach in text classification and more generally in text domain.
dc.format.extent	30cm.
dc.publisher	Thesis (Ph.D.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2010.
dc.subject.lcsh	Artificial intelligence.
dc.subject.lcsh	Text processing (Computer science)
dc.title	Improving text classification performance with the analysis of lexical dependencies and class-based feature selection
dc.format.pages	xx, 97 leaves;