Abstract:
During the last decades, the use of semantic text similarity has been adopted as a major component in many Natural Language Processing tasks, including text retrieval, summarization, and document categorization. Integration of semantic information acts as a powerful tool for a better understanding and structuring of text. Among the many domains that benefit from text mining studies, biomedical literature is one of the most challenging areas because of its domain-specific language. As an inevitable result of the complex nature of the biomedical literature, domain-specific adaptations are crucial requirements. There are several semantic text similarity approaches that have been applied on the word-level. However, and to the best of our knowledge, there has not been any research on sentence-level semantic similarity in the biomedical domain. Furthermore, our experimental results revealed that domain-independent state-of-theart approaches in sentence-level semantic similarity do not effectively cover biomedical knowledge and produce poor results. In this study, we propose several different approaches for domain-specific semantic sentence-level similarity computation, including measures utilizing distributional vector representations of sentences, methods combining general and domain specific ontologies, as well as a supervised approach exploiting high-level features. Our proposed methods are evaluated using a manually annotated data set which consists of 100 sentence pairs from biomedical literature. The experiments showed that the supervised semantic similarity computation approach obtained the best performance and improved over the previous domain-independent systems up to 42.6% in terms of the Pearson correlation metric.