Abstract:
Protein-ligand interactions play crucial roles in living organisms, thus they attract many researchers from various disciplines. There are protein-ligand interaction databases that provide information to researchers in a suitable format. These databases extract the interactions manually from biomedical literature but the extraction process is becoming harder each day because of the increase in the number of biomedical publication, thereby the need for an automated extraction system has arisen. The aim of this thesis is to fulfill this need via deep learning models. This thesis includes performance analysis of Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory (BiLSTM) Networks for the task of protein-ligand interaction extraction. Comparison of features in terms of their effect on the performance of the models is also included in the thesis. The gold standard corpus that is created for BioCreative VI ChemProt task is selected as our dataset for training and evaluation of our models. Word embeddings, distance embeddings, part of speech (POS) tags and inside outside beginning (IOB) chunk tags are used as features in the models. The grid search algorithm is applied to find the optimal hyperparameters for each model in the experiments. The best models and input representations are selected via using the development set then they are evaluated on the test set. Based on the results on the test set, we concluded that BiLSTM performs better than CNN for each evaluated feature setting.