Automated query-biased and structure-preserving document summarization for web search tasks

Pembe, Fatma Canan.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
Ph.D. Theses
→
View Item

Automated query-biased and structure-preserving document summarization for web search tasks

Pembe, Fatma Canan.

URI: http://digitalarchive.boun.edu.tr/handle/123456789/12559

Date: 2010.

Abstract:

With the drastic increase of available information sources on the Internet, people with different backgrounds in the world share the same problem: locating useful information for their actual needs. Search engines provide a means for users to locate documents on the Web via queries. However, users still have to perform the sifting process by themselves; i.e., to decide the relevance of each document with respect to their actual information needs. At this point, automatic summarization techniques can complement the task of search engines. Currently available search engines, such as Google and AltaVista, only show a limited capability in summarizing the Web documents; e.g. displaying only two or three lines of text fragments which consist of the query words and their surrounding text as the summary. In the literature, most of the research in automatic summarization has focused on creating general-purpose summaries without considering user needs. Also, summarization approaches have mostly seen a document as a flat sequence of sentences and ignored the structure within the documents. In the summarization literature, the effect of query-biased techniques and document structure have been considered only in a few studies and separately investigated. This research is distinguished from previous work by combining these two aspects in a coherent framework. In this thesis, we propose a novel summarization approach for Web search, i.e., query-biased and structure-preserving document summarization. The proposed system consists of two main stages. The first stage is the structural processing of Web documents in order to extract their section and subsection hierarchy together with the corresponding headings and subheadings. A document in the system is represented as an ordered tree of headings, subheadings and other text units. First, we formed a rule-based approach based on heuristics and HTML Document Object Model tree processing. Then, we developed a machine learning approach based on the tree representation using support vector machine (SVM) and perceptron algorithms. The methods were evaluated based on the accuracy of heading extraction and hierarchy extraction. The second stage of the research is to develop automatic summarization methods by utilizing the document structures obtained in the first stage. In the proposed method, the summary sentences are extracted in a query-biased way based on two levels of scoring: sentence scoring and section scoring. Document structure is utilized both in the summarization process and in the output summaries. The performance of the proposed system has been determined using several task-based evaluations. These include information retrieval tasks where the summaries will actually be used. The results of the experiments on Turkish and English documents show that the proposed system summaries are superior to Google extracts and unstructured query-biased summaries of the same size in terms of accuracy with reasonable judgment times. User ratings verify that query-biased and structure-preserving summaries are also found to be more useful by the users.

Show full item record