Abstract:
Statistical tests in the literature mainly use error rate for comparison and assume equal loss for false positives and negatives. Receiver Operating Characteristics (ROC) curves and/or the Area Under the ROC Curve (AUC) can also be used for comparing classffier performances under a spectrum of loss values. A ROC curve and hence an AUC value is typically calculated from one training/test pair and to average over randomness in folds, we propose to use k-fold cross-validation to generate a set of ROC curves and AUC values to which we can fit a distribution and test hypotheses on. Experiment results on 15 datasets using 5 different classification algorithms show that our proposed test using AUC values is to be preferred over the usual paired t test on error rate because it can detect equivalences and differences which the error test cannot. The approach we use for ROC curves can also be applied to Precision-Recall curves, used mostly in information retrieval by applying k-fold cross-validated test on the area under the Precision-Recall curve. When multiple classifiers are to be compared over one dataset or multiple datasets, we can use Analysis of Variance (ANOVA). When we use more than one performance metric, we use the multivariate ANOVA, that is, MANOVA. Performance metrics of ANOVA is error or AUC. Performance metrics of MANOVA are true positive, false positive, true negative and false negative rates. We also perform the nonparametric version of ANOVA which is called Friedman test. We apply Sign test when we compare multiple classifiers over multiple datasets. We observe that using more than one performance metric includes their correlation in the statistical test and therefore produces more accurate results.