Archives and Documentation Center
Digital Archives

Parallel point classification into geographical regions

Show simple item record

dc.contributor Graduate Program in Computational Science and Engineering.
dc.contributor.advisor Özturan, Can.
dc.contributor.author Tarmur, Sanver.
dc.date.accessioned 2023-03-16T10:03:43Z
dc.date.available 2023-03-16T10:03:43Z
dc.date.issued 2018.
dc.identifier.other CSE 2018 T37
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12368
dc.description.abstract The amount of data generated by social media, social networks and distributed platforms such as blockchain, have reached quite high levels. There are various use cases to process this huge amount of data. One is to classify the geo-tagged data which is produced by social networks into geographical regions. We propose an effi cient parallel classification approach and implement a classifier tool which is capable of processing huge geographical point data in parallel. Twitter data from five major cities of Turkey is used as classification test set considering the density of the regions. There are important factors effecting the classification performance such as spatial indexing and parallelization strategy. Hierarchical Triangular Mesh (HTM) and R-Tree spatial indexes are used for indexing regions and open-source Apache Spark and Kafka plat forms are used to implement our classification application in a distributed and scalable environment. The mentioned platforms are designed to handle huge data streams and quickly respond varying volume of data traffic. Benchmarks are provided in thesis to show effectiveness of our approach against built-in spatial index of Microsoft SQL Server and approach of Kondor et al. [1] in which HTM is applied on SQL Server. Our method has significant advantage since it is built upon Apache Spark platform which is crafted for processing chunks of data stream in real-time, however other approaches are based on SQL Server which cannot efficiently process massive streaming data. 1.6-4.5 fold speed-ups have been obtained in classification performance. The speed-up factor may change according to the query set size. Since our system has a scalable archi tecture it is possible to expand query set to billions of records. Apart from improved performance, our method is cost-effective since Twitter data collected over a month can be processed on cloud in around 3 hours with a small cost.
dc.format.extent 30 cm.
dc.publisher Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2018.
dc.subject.lcsh Parallel processing (Electronic computers)
dc.subject.lcsh Geographical perception.
dc.title Parallel point classification into geographical regions
dc.format.pages x, 63 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account