An evaluation of existing and new feature selection metrics in automatic text categorization

Taşcı, Şerafettin.

Arşiv ve Dokümantasyon Merkezi Dijital Arşivi Ana Sayfası
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
Öğe Göster

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Güngör, Tunga.
dc.contributor.author	Taşcı, Şerafettin.
dc.date.accessioned	2023-03-16T09:59:43Z
dc.date.available	2023-03-16T09:59:43Z
dc.date.issued	2008.
dc.identifier.other	CMPE 2008 T37
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12099
dc.description.abstract	In recent years, the amount of available documents in the electronic medium such as electronic books, digital libraries and email messages increased rapidly. Therefore, the task of organizing and manipulating these resources has gained more importance and has become more difficult. Automatic text categorization is widely used for organizing and manipulating these documents in the electronic medium. However, since the data in text categorization is very high-dimensional, feature selection is crucial to make the task more efficient and precise. In this study, we make an extensive evaluation of the feature selection metrics used in text categorization by using local and global policies. For the experiments, we use seven datasets which vary in size, complexity and skewness. We use SVM as the classifier and tfidf weighting for term weighting. We observed that almost in all metrics and datasets, the local policy outperforms others when the number of keywords is low and global policy outperforms others as the number of keywords increases. In addition to the evaluation of the existing feature selection metrics, we propose new metrics which have shown high success rates especially with low number of keywords. Moreover, we propose a keyword selection framework called Adaptive Keyword Selection (AKS). It is based on selecting different number of keywords for different classes and it improved the performance significantly in skew datasets.
dc.format.extent	30cm.
dc.publisher	Thesis (M.S.)-Bogazici University. Institute for Graduate Studies in Science and Engineering, 2008.
dc.relation	Includes appendices.
dc.relation	Includes appendices.
dc.subject.lcsh	Text processing (Computer science)
dc.subject.lcsh	Information storage and retrieval systems.
dc.subject.lcsh	Machine learning.
dc.title	An evaluation of existing and new feature selection metrics in automatic text categorization
dc.format.pages	xiii, 73 leaves;