Abstract:
In the present thesis, we propose spam e-mail filtering methods having high accuracies and low time complexities. The methods are based on the n-gram approach and a heuristics which is referred to as the first n-words heuristics. Though the main concern of the research is studying the applicability of these methods on Turkish e-mails, they were also applied to English e-mails. A data set for both languages was compiled. Tests were performed with different parameters. Success rates above 95% for Turkish e-mails and around 98% for English e-mails were obtained. In addition, it has been shown that the time complexities can be reduced significantly without sacrificing from success. We also propose a combined perception refinement (CPR) which improves baseline success rates around 2%, where development set is used in the first step of the CPR to find out the parameters used in the second step. Free word order is another characteristic of Turkish language; we will make an attempt to implement free word order aspect of Turkish.