Archives and Documentation Center
Digital Archives

Utilizing weakly-supervised learning for hashtag segmentation and named entity disambiguation

Show simple item record

dc.contributor Ph.D. Program in Computer Engineering.
dc.contributor.advisor Özgür, Arzucan.
dc.contributor.author Çelebi, Arda.
dc.date.accessioned 2023-03-16T10:14:04Z
dc.date.available 2023-03-16T10:14:04Z
dc.date.issued 2020.
dc.identifier.other CMPE 2020 C45 PhD
dc.identifier.uri http://digitalarchive.boun.edu.tr/handle/123456789/12639
dc.description.abstract Today’s high-performing machine learning algorithms learn to predict by the supervision of large amounts of human-labeled data. However, the labeling process is costly in terms of time and effort. In this thesis, we design weakly-supervised ap proaches, which are based on automatically labeling raw data, for two different Natural Language Processing (NLP) tasks, namely hashtag segmentation and Named Entity Disambiguation (NED). Hashtag segmentation’s aim is to identify the words in the hashtags, so as to process and understand them better. We propose a heuristic to ob tain automatically segmented hashtags using a large tweet corpus and use these data to train a maximum entropy classifier. State-of-the-art accuracy is achieved for hashtag segmentation without using any manually labeled training data. The target of NED, which is the second task that we address, is to link the named entity (NE) mentions in text to their corresponding records in the Knowledge Base. We hypothesize that the types of the NE mentions may provide useful clues for their correct disambigua tion. The standard approaches for identifying mention types require a type taxonomy and large amounts of mentions annotated with their types. We propose a cluster-based mention typing approach, which does not require a type taxonomy or labeled mentions. This weakly-supervised approach is based on clustering the NEs in Wikipedia by using different levels of contextual information and automatically generating data for train ing a mention typing model. The mention type predictions lead to significant F-score improvement when incorporated to a supervised NED model. This thesis shows that designing weakly-supervised approaches by considering the underlying characteristics of the addressed problem can be an effective strategy for NLP
dc.format.extent 30 cm.
dc.publisher Thesis (Ph.D.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2020.
dc.subject.lcsh Natural language processing (Computer science)
dc.title Utilizing weakly-supervised learning for hashtag segmentation and named entity disambiguation
dc.format.pages xxv, 178 leaves ;


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search Digital Archive


Browse

My Account