Abstract:
Categorical variables are present in most real-world datasets, often consisting of a high number of levels, referred to as high-cardinality categorical variables. Most machine learning algorithms do not have an innate mechanism to deal with categor ical variables, hence, their encoding is necessary. Categorical variable encoding is the general term for the conversion of nominal independent variables to a numerical format. Many encoding strategies exist, and they are discussed in this thesis. This the sis presents a novel encoding strategy, categorical split encoding, and also provides an analysis of existing encoding methods. Categorical split encoding uses primary and sur rogate split information as the vector representation for categorical variables, through a tree-based algorithm, this method outputs binary columns for each categorical variable making use of target information. Missing values are imputed by using surrogate infor mation, while clustering similar values together based on the path they take through the decision tree algorithm. Various existing encoding strategies are benchmarked for comparison with the proposed strategy. The performance of categorical split encod ing and other encoding methods is compared with three different machine learning algorithms (generalized linear models, random forest and xgboost) using datasets from regression, binary and multiclass classification settings. Datasets used are made pub licly available for replication purposes. As a result, categorical split encoding provides competitive results compared to existing encoding strategies in various datasets.