Text normalization using lexical and contextual features

Uluşahin Sönmez, Çağıl.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Özgür, Arzucan.
dc.contributor.author	Uluşahin Sönmez, Çağıl.
dc.date.accessioned	2023-03-16T10:01:46Z
dc.date.available	2023-03-16T10:01:46Z
dc.date.issued	2014.
dc.identifier.other	CMPE 2014 U68
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/12265
dc.description.abstract	The informal nature of social media text, renders it very di cult to be automatically processed by natural language processing tools. Text normalization, which corresponds to restoring the noisy words to their canonical forms, provides a solution to this challenge. We introduce an unsupervised text normalization approach that utilizes not only lexical, but also contextual and grammatical features of social text. The contextual and grammatical features are extracted from a word association graph built by using a large unlabeled social media text corpus. The graph encodes the relative positions of the words with respect to each other, as well as their part-of-speech tags. The lexical features are obtained by using the longest common subsequence ratio and edit distance measures to encode the surface similarity among words, and the double metaphone algorithm to represent the phonetic similarity. Unlike most of the recent approaches that are based on generating normalization dictionaries, the proposed approach performs normalization by considering the context of the noisy words in the input text. Our results show that it achieves state-of-the-art F-score performance on a standard data set. In addition, the system can be tuned to achieve very high precision without sacri cing much from recall.
dc.format.extent	30 cm.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2014.
dc.subject.lcsh	Text processing (Computer science)
dc.title	Text normalization using lexical and contextual features
dc.format.pages	xi, 39 leaves ;