dc.description.abstract |
Rapidly increasing computational power and sequencing technologies, which are at the peak of their development, enable the use of advanced algorithms with high processing volume to predict the intracellular functions of proteins, which is one of the most important problems in computational biology. The functionalities of proteins emerge primarily through their three-dimensional folded structures. When these structures are interpreted as graphs, the application of graph neural networks leads to promising results. However, these approaches are limited as the three-dimensional folded structures are not yet known for most proteins. The fact that the amino acid sequences of proteins have properties similar to natural languages and the large amounts of sequence data suggest that these sequences can be processed using natural language processing (NLP) methods. In this thesis, two different NLP methods are adapted to the problem of protein function prediction, assuming that the protein sequence data contain necessary and sufficient information to predict both three-dimensional folded structure and intracellular function: (i) Bidirectional Transoformer BERT model (ii) Heterogeneous Graph Convolutional Network (GCN) model. The results show that it is more advantageous to treat the proteins as graphs. The GCN model performs better than the BERT model and achieves performance close to the state-of-the-art model that uses three-dimensional folding information. In addition, we find that tokenizing the sequences instead of using the individual amino acids as tokens increases the performance. |
|