Abstract:
Considering the rapid increase in the biomedical literature, manual extraction of information regarding Protein-Protein Interactions (PPIs) becomes an exhausting task. Therefore, there is a strong need for the development of automatic relation extraction techniques from scientific publications. In this study, we introduce a novel two-stage system to extract PPIs from biomedical text. Our approach contains two cascaded stages. In the first stage, we utilize a transformer-based model, BioBERT, to determine whether pairs of proteins appearing in a sentence interact with each other; therefore, we perform a binary relation extraction task. In the second stage, we adopt a Generative Adversarial Network (GAN) model that consists of two contesting neural networks to eliminate false-positive predictions of the first stage. We evaluate the performance of both stages separately on five benchmark PPI corpora: AIMed, BioInfer, HPRD50, IEPA, and LLL. Later on, we combine the five corpora into a single source to examine the system performance on a general PPI corpus. Finally, we apply our system to a case study for Host-Pathogen Interaction extraction from the COVID-19 literature. The experimental results show that our first stage achieves the state-of-the-art F1-score of 79.0% on the AIMed corpus and obtains comparable results to previous studies on the other four corpora. Moreover, our second stage results reveal that the GAN model improves the first stage results when our BioBERT model is trained on the combined corpus. Our case study results demonstrate that the proposed system can be useful as a real-world application.