A web server for using Transfer Learning with Pretrained BERT Language Representation Models to Classify Family of Glucose Transporters
Glucose is the essential substance that provide energy to the human body to perform various activities smoothly, whereas, carbohydrate present in various foods is the substantial source for glucose. As it serves as a vital source of energy in mammalian cells, glucose transporter plays an important role in cell signaling and precursor for biomolecule synthesis. More Interestingly, the human body consumes approximately 20% of oxygen and 25% of the glucose, whereas, human brain represents only 2% of total body mass of an adult. In addition to this fact, the continuous supply of the energy is an absolute requirement for maintaining cell growth to life. Therefore, the movement of glucose across the lipid bilayer is far more important and as a fundamental process for metabolism, development, growth and homeostasis. Furthermore, in humans, 3 types of glucose transporters have been identified, which are facilitative glucose transporters (GLUTs), the sodium-glucose cotransporters (SGLTs) and the Sugars Will Eventually be Exported Transporters (SWEETs).
Recently deep learning is the most dominating approach and already applied in many different fields such as Natural Language Processing (NLP), Computer vision, bio-informatics and computational biology to accomplish various tasks. One of the promising pre-trained language model was studied which uses embedding vectors to classify protein sequences.We used BERT models to generate features for input protein sequences. This study attempts to apply diffrent BERT models with SVM classifier in prediction of 3 types of glucose transport proteins. We have performed some preprocessing steps including data collection, feature generation, feature standardization, imbalance problem handling and classification.
We have gathered protein sequences for GLUT, SGLT and SWEET from National Center for Biotechnology Information (NCBI) database.
Datasets | Original Data | Similarity
< 40% (CD-HIT) |
Benchmark datasets | |
---|---|---|---|---|
Testing Data | Training Data | |||
GLUT | 9616 | 510 | 102 | 408 |
SGLT | 4107 | 225 | 45 | 180 |
SWEET | 2026 | 190 | 38 | 152 |
Total Proteins | 15749 | 925 | 185 | 740 |
As a preprocessing step we removed protein sequences having more than 40% similarity with the CD-HIT program, after that the resultant dataset was divided into training and testing for the 3 types of the Glucose Transporters.
We generated one-gram sequence with topmost 510 residues for each protein. BERT also append two special tokens (CLS and SEP) at the start and end of each protein sequence. Protein sequence with a length of 510 residues (as one-gram) were used as input to the BERT model,and it will output a vector for each amino acid with a length of 768 for BERT-Base and 1024 for BERT-large. Currently, we have 12 separate vectors each of length 768 for BERT-base and 1024 for BERT large corresponding to each of the amino acid residues. Therefore, we utilized the last hidden layer as a feature vector for our protein sequences. Moreover, the ultimate vectors were obtained by summing up individual feature vectores for each amino acid and then divided by the sequence length.
We applied the concept of Contextualize word embeddings using BERT because they capture the semantics of the words, and are more sensitive towards the context in which words exist in any sentence.
Support vector machine (SVM) having 10 as cost value and 0.0001 as gamma value along with Radial Basis Function (RBF) kernel has been implemented as the underlying classifier in all the experiments performed in this study because it outperformed all the other traditional machine-learning classifiers.
If you would like to build a model and evaluate our model, we provide the dataset as the below link.
Download dataset.zipGoogle released several new models of BERT, which are enhancement and improvement of the basic pre-processing code. In our study, we used four types of BERT models and we also developed our own BERT model which uses the configuration of the BERT base (uncased). The details about these models are given in table 2;
# | Model | Configurations | Link |
---|---|---|---|
1 | BERT-Base (Uncased) | 12-layer, 768-hidden, 12-heads, 110M parameters | Download from Google |
2 | BERT-Large (Uncased) | 24-layer, 1024-hidden, 16-heads, 340M parameters | Download from Google |
3 | BERT-Base (Cased) | 12-layer, 768-hidden, 12-heads , 110M parameters | Download from Google |
4 | BERT-Large (Cased) | 24-layer, 1024-hidden, 16-heads, 340M parameters | Download from Google |
In order to avoid the errors, please submit the sequence in fasta format (we also give you the fasta file examples). The user can choose two options to submit, including paste the sequence into text area and upload sequence file. The user can submit one single fasta file or multiple fasta file. In the result page, we show the results for the sequences which belong to transport proteins or not.
>O68460 MAGIYLFVVAAALAALGYGALTIKTIMAADAGTARMQEISGAVQEGASAFLNRQYKTIAV VGAVVFVILTALLGISVGFGFLIGAVCSGIAGYVGMYISVRANVRVAAGAQQGLARGLEL AFQSGAVTGMLVAGLALLSVAFYYILLVGIGATGRALIDPLVALGFGASLISIFARLGGG IFTKGADVGADLVGKVEAGIPEDDPRNPAVIADNVGDNVGDCAGMAADLFETYAVTVVAT MVLASIFFAGVPAMTSMMAYPLAIGGVCILASILGTKFVKLGPKNNIMGALYRGFLVSAG ASFVGIILATAIVPGFGDIQGANGVLYSGFDLFLCAVIGLLVTGLLIWVTEYYTGTNFRP VRSVAKASTTGHGTNVIQGLAISMEATALPALIICAAIITTYQLSGLFGIAITVTSMLAL AGMVVALDAYGPVTDNAGGIAEMANLPEDVRKTTDALDAVGNTTKAVTKGYAIGSAGLGA LVLFAAYTEDLAFFKANVDAYPAFAGVDVNFSLSSPYVVVGLFIGGLLPYLFGSMGMTAV GRAAGSVVEEVRRQFREIPGIMEGTAKPEYGRCVDMLTKAAIKEMIIPSLLPVLAPIVLY FVILGIADKSAAFSALGAMLLGVIVTGLFVAISMTAGGGAWDNAKKYIEDGHYGGKGSEA HKAAVTGDTVGDPYKDTAGPAVNPMIKITNIVALLLLAVLAH >O06342 MFPAAVGVLWQSGLRDPTPPGGPHGIEGLSLAFEKPSPVTALTQELRFATTMTGGVSLAI WMAGVTREINLLAQASQWRRLGGTFPTNSQLTNESAASLRLYAQLIDLLDMVVDVDILSG TSAGGINAALLASSRVTGSDLGGIRDLWLDLGALTELLRDPRDKKTPSLLYGDERIFAAL AKRLPKLATGPFPPTTFPEAARTPSTTLYITTTLLAGETSRFTDSFGTLVQDVDLRGLFT FTETDLARPDTAPALALAARSSASFPLAFEPSFLPFTKGTAKKGEVPARPAMAPFTSLTR PHWVSDGGLLDNRPIGVLFKRIFDRPARRPVRRVLLFVVPSSGPAPDPMHEPPPDNVDEP LGLIDGLLKGLAAVTTQSIAADLRAIRAHQDCMEARTDAKLRLAELAATLRNGTRLLTPS LLTDYRTREATKQAQTLTSALLRRLSTCPPESGPATESLPKSWSAELTVGGDADKVCRQQ ITATILLSWSQPTAQPLPQSPAELARFGQPAYDLAKGCALTVIRAAFQLARSDADIAALA EVTEAIHRAWRPTASSDLSVLVRTMCSRPAIRQGSLENAADQLAADYLQQSTVPGDAWER LGAALVNAYPTLTQLAASASADSGAPTDSLLARDHVAAGQLETYLSYLGTYPGRADDSRD APTMAWKLFDLATTQRAMLPADAEIEQGLELVQVSADTRSLLAPDWQTAQQKLTGMRLHH FGAFYKRSWRANDWMWGRLDGAGWLVHVLLDPRRVRWIVGERADTNGPQSGAQWFLGKLK ELGAPDFPSPGYPLPAVGGGPAQHLTEDMLLDELGFLDDPAKPLPASIPWTALWLSQAWQ QRVLEEELDGLANTVLDPQPGKLPDWSPTSSRTWATKVLAAHPGDAKYALLNENPIAGET FASDKGSPLMAHTVAKAAATAAGAAGSVRQLPSVLKPPLITLRTLTLSGYRVVSLTKGIA RSTIIAGALLLVLGVAAAIQSVTVFGVTGLIAAGTGGLLVVLGTWQVSGRLLFALLSFSV VGAVLALATPVVREWLFGTQQQPGWVGTHAYWLGAQWWHPLVVVGLIALVAIMIAAATPG RR >P11166 MEPSSKKLTGRLMLAVGGAVLGSLQFGYNTGVINAPQKVIEEFYNQTWVHRYGESILPTT LTTLWSLSVAIFSVGGMIGSFSVGLFVNRFGRRNSMLMMNLLAFVSAVLMGFSKLGKSFE MLILGRFIIGVYCGLTTGFVPMYVGEVSPTALRGALGTLHQLGIVVGILIAQVFGLDSIM GNKDLWPLLLSIIFIPALLQCIVLPFCPESPRFLLINRNEENRAKSVLKKLRGTADVTHD LQEMKEESRQMMREKKVTILELFRSPAYRQPILIAVVLQLSQQLSGINAVFYYSTSIFEK AGVQQPVYATIGSGIVNTAFTVVSLFVVERAGRRTLHLIGLAGMAGCAILMTIALALLEQ LPWMSYLSIVAIFGFVAFFEVGPGPIPWFIVAELFSQGPRPAAIAVAGFSNWTSNFIVGM CFQYVEQLCGPYVFIIFTVLLVLFFIFTYFKVPETKGRTFDEIASGFRQGGASQSDKTPE ELFHPLGADSQV
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
If you have any problem or suggest any idea for our website, feel free to contact us via email: [email protected]