Abstract:
This thesis proposes a machine learning- and rule-based system for the identifi cation of adverse drug reaction (ADR) entity mentions in the text of drug labels and their normalization through the MedDRA dictionary. The machine learning approach is based on a recently proposed deep learning model that works on the sentence level. The model makes use of the combination of the pre-trained word embeddings and Con volutional Neural Network (CNN) embeddings generated from the characters of a given token. These tokens are initially passed through bi-directional Long Short-Term Mem ory (Bi-LSTM) layers for feature extraction. Finally, a Conditional Random Fields (CRF) classifier is trained on those extracted features for the prediction of the target mentions. The rule-based approach, used for normalizing the identified ADR mentions to MedDRA terms, is based on an extension of the text-mining system called SciMiner. The proposed system is evaluated with the TAC-ADR 2017 challenge dataset. Since this dataset contains mentions that are disjoint and overlapping, the model also uses a recently proposed chunking scheme designed to handle those types. The model ob tained 76.97 f-score performance on the TAC dataset. Some of the challenges for the worse performance compared to performance of the models trained on the generic news paper text are the small size of the training dataset and the uneven distribution of the class instances.