Utilizing weakly-supervised learning for hashtag segmentation and named entity disambiguation

Çelebi, Arda.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
Ph.D. Theses
→
View Item

dc.contributor	Ph.D. Program in Computer Engineering.
dc.contributor.advisor	Özgür, Arzucan.
dc.contributor.author	Çelebi, Arda.
dc.date.accessioned	2023-03-16T10:14:04Z
dc.date.available	2023-03-16T10:14:04Z
dc.date.issued	2020.
dc.identifier.other	CMPE 2020 C45 PhD
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12639
dc.description.abstract	Today’s high-performing machine learning algorithms learn to predict by the supervision of large amounts of human-labeled data. However, the labeling process is costly in terms of time and effort. In this thesis, we design weakly-supervised ap proaches, which are based on automatically labeling raw data, for two different Natural Language Processing (NLP) tasks, namely hashtag segmentation and Named Entity Disambiguation (NED). Hashtag segmentation’s aim is to identify the words in the hashtags, so as to process and understand them better. We propose a heuristic to ob tain automatically segmented hashtags using a large tweet corpus and use these data to train a maximum entropy classifier. State-of-the-art accuracy is achieved for hashtag segmentation without using any manually labeled training data. The target of NED, which is the second task that we address, is to link the named entity (NE) mentions in text to their corresponding records in the Knowledge Base. We hypothesize that the types of the NE mentions may provide useful clues for their correct disambigua tion. The standard approaches for identifying mention types require a type taxonomy and large amounts of mentions annotated with their types. We propose a cluster-based mention typing approach, which does not require a type taxonomy or labeled mentions. This weakly-supervised approach is based on clustering the NEs in Wikipedia by using different levels of contextual information and automatically generating data for train ing a mention typing model. The mention type predictions lead to significant F-score improvement when incorporated to a supervised NED model. This thesis shows that designing weakly-supervised approaches by considering the underlying characteristics of the addressed problem can be an effective strategy for NLP
dc.format.extent	30 cm.
dc.publisher	Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2020.
dc.subject.lcsh	Natural language processing (Computer science)
dc.title	Utilizing weakly-supervised learning for hashtag segmentation and named entity disambiguation
dc.format.pages	xxv, 178 leaves ;