Abstract:
The generation of novel compounds targeting a protein of interest is a compelling task in the pharmaceutical industry. Deep generative models have been applied to targeted molecular design and have shown promising results. However, such models are often limited by the availability of the data they rely on such as protein structure or protein-ligand binding affinity. Notwithstanding, vast amounts of unlabeled protein sequences and chemical compounds are available and have been used to train models which learn useful representations. To transfer this knowledge to targeted drug design, we propose using warm start strategy to initialize models with those pretrained models. We investigate two warm start strategies: (i) one-stage strategy where the initialized model is trained on targeted molecule generation (ii) two stage strategy containing a pre-finetuning on molecular generation followed by target specific training. We also use two decoding strategies to generate compounds: beam search and sampling. The results show that the warm-started models perform better than a baseline model trained from scratch on different percentages of data and decoding strategies. The proposed warm starting strategies obtain similar results in terms of widely used metrics from benchmarks. However, docking evaluation of the generated compounds for a set of novel proteins suggests that the one stage strategy generalizes better than the two stage strategy. Additionally, we observe that beam search outperforms sampling in both docking evaluation and benchmark metrics assessing the quality of compounds.