Abstract:
Finding high-affinity protein-chemical pairs is a prominent stage of the drug discovery pipeline. However, the number of available proteins and chemicals forms an experimentally insurmountable combination space and necessitates computational approaches. Drug-target affinity prediction models come into play here and rapidly highlight the high-affinity pairs. This thesis introduces state-of-the-art drug-target affinity prediction models and training strategies to facilitate drug discovery studies. The introduced approaches leverage biomolecular language processing techniques which interpret the chemicals and proteins as documents formed in biomolecular languages. The units of bimolecular languages, named biomolecular words, are discovered in large corpora and pharmacologically verified as meaningful substructures. The biomolecular words are used to develop a novel drug-target affinity prediction framework: ChemBoost. ChemBoost models leverage the biomolecule word-driven representations and achieve state-of-the-art prediction performance. The experiments also demonstrate that unseen biomolecules challenge all drug-target affinity prediction models and reveal a generalizability problem. A language-inspired model training framework, DebiasedDTA, is introduced to target the problem. The evaluations indicate that DebiasedDTA boosts models on seen and unseen biomolecules, especially when the target pair is dissimilar to training biomolecules. ChemBoost and DebiasedDTA are published as an open-source python package, pydta.