Abstract:
The sharing of chemical-protein interactions (CPI) with the scientific commu nities plays a crucial role in understanding the mechanisms of diseases, as well as in facilitating drug discovery and drug repurposing studies. Significant amount of knowl edge on CPI is published in unstructured documents. The goal of this thesis is to extract relations between chemicals and proteins from information provided in sen tences. For this purpose, we focus on two tasks: (i) binary relation extraction and (ii) multi-class relation extraction from biomedical documents. The aim of the first task is to identify whether a sentence states a relation between a pair of biochemicals or not. On the other hand, the second task extends the first one by also aiming at identifying the type of the relation between the pair of biochemicals. For both tasks, we develop transformer-based models by utilising the BioBERT and SciBERT architectures. Fur thermore, we investigate the effectiveness of different input representation approaches such as sentence and dependency tree-based representations. Our results demonstrate that BioBERT based model with whole sentence input representation achieves the best performance for both tasks on the benchmark ChemProt test data set with an F1-score of 77.8% for binary relation extraction and micro-averaged F1-score of 76.1% for multi class relation extraction. Interestingly, the significantly shorter dependency tree based input representations achieve close F1-scores to whole sentence input representation. Finally, we introduce Vapur, which is a search engine for protein-chemical interactions extracted from COVID-19 related scientific publications. Vapur shows that our relation extraction models can be effectively used in real-world biomedical applications.