Abstract:
Binary classi cation of biological data is an important research problem both in the Bioinformatics and Machine Learning elds. This problem is particularly challenging when the number of labeled instances is very few. There are three main machine learning approaches for classi cation: supervised methods, which only use labeled data, unsupervised methods, which only use unlabeled data, and semi-supervised methods, which use both labeled and unlabeled data. In this study, we compare the supervised and various developed semi-supervised methods which are based on k-NN (k Nearest Neighbor), SVM (Support Vector Machine) with linear kernel, and SVM with RBF (Radial Basis Function) kernel for two di erent Bioinformatics problems: predicting reccurrence in colorectal cancer from microarray data and predicting HIV-1-Human protein-protein interactions. As distinct from traditional semi-supervised learning approaches, we introduce the de nition of `softly labeled' data that de nes unlabeled data with additional information about their highly expected labels. We also evaluate our algorithms on a well-known optical digit dataset to classify the numbers `5' and `6' by generating synthetic noise and use as softly labeled data to better understand the behaviors of our algorithms. For all datasets, we concluded that softly labeled data are informative and enhances the evaluation results. Our semi-supervised methods SS-kNN (Semi-supervised kNN) and SS-SVM (Semi-supervised SVM) perform better than other algorithms in terms of accuracy for colorectal cancer and optical digit data, and area under the precision-recall curve for HIV-1-human protein-protein interaction data. Furthermore, in general, our semi-supervised methods achieve better performances than the supervised ones.