Parallel point classification into geographical regions

Tarmur, Sanver.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

Parallel point classification into geographical regions

Tarmur, Sanver.

URI: http://digitalarchive.boun.edu.tr/handle/123456789/12368

Date: 2018.

Abstract:

The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. There are various use cases to process this huge amount of data. One is to classify the geo-tagged data which is produced by social networks into geographical regions. We propose an eﬃ cient parallel classiﬁcation approach and implement a classiﬁer tool which is capable of processing huge geographical point data in parallel. Twitter data from ﬁve major cities of Turkey is used as classiﬁcation test set considering the density of the regions. There are important factors eﬀecting the classiﬁcation performance such as spatial indexing and parallelization strategy. Hierarchical Triangular Mesh (HTM) and R-Tree spatial indexes are used for indexing regions and open-source Apache Spark and Kafka plat forms are used to implement our classiﬁcation application in a distributed and scalable environment. The mentioned platforms are designed to handle huge data streams and quickly respond varying volume of data traﬃc. Benchmarks are provided in thesis to show eﬀectiveness of our approach against built-in spatial index of Microsoft SQL Server and approach of Kondor et al. [1] in which HTM is applied on SQL Server. Our method has signiﬁcant advantage since it is built upon Apache Spark platform which is crafted for processing chunks of data stream in real-time, however other approaches are based on SQL Server which cannot eﬃciently process massive streaming data. 1.6-4.5 fold speed-ups have been obtained in classiﬁcation performance. The speed-up factor may change according to the query set size. Since our system has a scalable archi tecture it is possible to expand query set to billions of records. Apart from improved performance, our method is cost-eﬀective since Twitter data collected over a month can be processed on cloud in around 3 hours with a small cost.

Show full item record