Abstract:
The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. There are various use cases to process this huge amount of data. One is to classify the geo-tagged data which is produced by social networks into geographical regions. We propose an effi cient parallel classification approach and implement a classifier tool which is capable of processing huge geographical point data in parallel. Twitter data from five major cities of Turkey is used as classification test set considering the density of the regions. There are important factors effecting the classification performance such as spatial indexing and parallelization strategy. Hierarchical Triangular Mesh (HTM) and R-Tree spatial indexes are used for indexing regions and open-source Apache Spark and Kafka plat forms are used to implement our classification application in a distributed and scalable environment. The mentioned platforms are designed to handle huge data streams and quickly respond varying volume of data traffic. Benchmarks are provided in thesis to show effectiveness of our approach against built-in spatial index of Microsoft SQL Server and approach of Kondor et al. [1] in which HTM is applied on SQL Server. Our method has significant advantage since it is built upon Apache Spark platform which is crafted for processing chunks of data stream in real-time, however other approaches are based on SQL Server which cannot efficiently process massive streaming data. 1.6-4.5 fold speed-ups have been obtained in classification performance. The speed-up factor may change according to the query set size. Since our system has a scalable archi tecture it is possible to expand query set to billions of records. Apart from improved performance, our method is cost-effective since Twitter data collected over a month can be processed on cloud in around 3 hours with a small cost.