Abstract:
The increasing difficulty of retrieving relevant information from rapidly growing literature has raised the interest for natural language processing (NLP) systems in the biomedical domain. In many of these systems, detection of named entities such as diseases, genes, and molecules (named entity recognition) and matching them to the corresponding entries in ontologies (normalization) are important intermediate steps. As these two tasks are related and datasets in this domain are relatively small, multi task learning has been frequently used in the literature for this problem. Meanwhile, in recent years, the success of transformer-based pre-trained language models such as BERT in various NLP tasks has led them to be also applied in the biomedical domain. The different characteristics of biomedical text such as abbreviations and specific terminology motivated the development of new language models, which were trained specifically for this domain using a biomedical corpus. In this study, we propose a multi-task learning approach for named entity recognition and normalization by utilizing transformer-based pre-trained language models. To enable the optimal sharing of information, both tasks are formulated with text span embeddings obtained with a common encoder network. Promising results are obtained and compared with the results of state-of-the-art systems from the literature for commonly used named entity recognition datasets.