Predicting intracellular functions of proteins from amino acid sequences using language processing methods

Çaldır, Bedirhan.

Archives and Documentation Center Digital Archives Home
→
Boğaziçi Üniversitesi Tezleri
→
Fen Bilimleri Enstitüsü
→
Bilgisayar Mühendisliği
→
M.S. Theses
→
View Item

dc.contributor	Graduate Program in Computer Engineering.
dc.contributor.advisor	Özgür, Arzucan.
dc.contributor.author	Çaldır, Bedirhan.
dc.date.accessioned	2023-10-15T06:48:30Z
dc.date.available	2023-10-15T06:48:30Z
dc.date.issued	2022
dc.identifier.other	CMPE 2022 C35
dc.identifier.uri	http://digitalarchive.boun.edu.tr/handle/123456789/19700
dc.description.abstract	Rapidly increasing computational power and sequencing technologies, which are at the peak of their development, enable the use of advanced algorithms with high processing volume to predict the intracellular functions of proteins, which is one of the most important problems in computational biology. The functionalities of proteins emerge primarily through their three-dimensional folded structures. When these structures are interpreted as graphs, the application of graph neural networks leads to promising results. However, these approaches are limited as the three-dimensional folded structures are not yet known for most proteins. The fact that the amino acid sequences of proteins have properties similar to natural languages and the large amounts of sequence data suggest that these sequences can be processed using natural language processing (NLP) methods. In this thesis, two different NLP methods are adapted to the problem of protein function prediction, assuming that the protein sequence data contain necessary and sufficient information to predict both three-dimensional folded structure and intracellular function: (i) Bidirectional Transoformer BERT model (ii) Heterogeneous Graph Convolutional Network (GCN) model. The results show that it is more advantageous to treat the proteins as graphs. The GCN model performs better than the BERT model and achieves performance close to the state-of-the-art model that uses three-dimensional folding information. In addition, we find that tokenizing the sequences instead of using the individual amino acids as tokens increases the performance.
dc.publisher	Thesis (M.S.) - Bogazici University. Institute for Graduate Studies in Science and Engineering, 2022.
dc.subject.lcsh	Proteins.
dc.subject.lcsh	Amino acid sequence.
dc.subject.lcsh	Computational biology.
dc.title	Predicting intracellular functions of proteins from amino acid sequences using language processing methods
dc.format.pages	xiii, 79 leaves