A deep learning-based extractive text summarization system for Turkish news articles

Gündeş, Özcan.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Sosyal Bilimler Enstitüsü
→
Yönetim Bilişim Sistemleri
→
M.A. Theses
→
View Item

dc.contributor	Graduate Program in Management Information Systems.
dc.contributor.advisor	Durahim, Ahmet Onur.
dc.contributor.author	Gündeş, Özcan.
dc.date.accessioned	2023-03-16T12:51:33Z
dc.date.available	2023-03-16T12:51:33Z
dc.date.issued	2020.
dc.identifier.other	MIS 2020 G86
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/18106
dc.description.abstract	The goal of this study is to develop an automated extractive summarization system for Turkish news using pre-trained language models. Pre-trained language models have been applied to wide range Natural Language Processing tasks and achieve state of the art performance results. In this thesis, pre-trained language models for Turkish are applied on extractive summarization task. The proposed model has a pre-trained language model and on top of it, Transformer layers are added to capture document level features and semantic relationships between the sentences in the news articles. Then, these sentences are scored with sigmoid function, which outputs a real value between 0 and 1. To train this model, 2076 news are collected from well-known Turkish news website. After the data collection, each sentence in the articles is labelled as 0 or 1 with a heuristic algorithm. By using these labels, an extractive model is trained. In the test time, Top-5 scoring sentences are combined to generate final summaries. Also, to investigate the effects of hyperparameters, 241 different models, which have different architecture and hyperparameter sets, are run. The best one has achieved 38.38 Rouge-1 F score, 26.8 Rouge-2 F score and 38.04 Rouge-L F score. These scores are promising since they are significantly greater than LEAD-5 baseline, which has 37.49, 26.4 and 37.12 Rouge F scores. For this study, LEAD-5 is very strong baseline since the most significant sentences are placed at the beginning of the news to capture the readers’ attention. Therefore, the proposed model shows a good performance for Turkish news dataset.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.A.) - Bogazici University. Institute for Graduate Studies in the Social Sciences, 2020.
dc.subject.lcsh	Natural language processing (Computer science)
dc.title	A deep learning-based extractive text summarization system for Turkish news articles
dc.format.pages	xi, 102 leaves ;