Introduction

Glucose is the essential substance that provide energy to the human body to perform various activities smoothly, whereas, carbohydrate present in various foods is the substantial source for glucose. As it serves as a vital source of energy in mammalian cells, glucose transporter plays an important role in cell signaling and precursor for biomolecule synthesis. More Interestingly, the human body consumes approximately 20% of oxygen and 25% of the glucose, whereas, human brain represents only 2% of total body mass of an adult. In addition to this fact, the continuous supply of the energy is an absolute requirement for maintaining cell growth to life. Therefore, the movement of glucose across the lipid bilayer is far more important and as a fundamental process for metabolism, development, growth and homeostasis. Furthermore, in humans, 3 types of glucose transporters have been identified, which are facilitative glucose transporters (GLUTs), the sodium-glucose cotransporters (SGLTs) and the Sugars Will Eventually be Exported Transporters (SWEETs).

Recently deep learning is the most dominating approach and already applied in many different fields such as Natural Language Processing (NLP), Computer vision, bio-informatics and computational biology to accomplish various tasks. One of the promising pre-trained language model was studied which uses embedding vectors to classify protein sequences.We used BERT models to generate features for input protein sequences. This study attempts to apply diffrent BERT models with SVM classifier in prediction of 3 types of glucose transport proteins. We have performed some preprocessing steps including data collection, feature generation, feature standardization, imbalance problem handling and classification.

Methodology

We have gathered protein sequences for GLUT, SGLT and SWEET from National Center for Biotechnology Information (NCBI) database.

**Table 1.** Datasets for glucose transporters families.
Datasets	Original Data	Similarity < 40% (CD-HIT)	Benchmark datasets
Datasets	Original Data	Similarity < 40% (CD-HIT)	Testing Data	Training Data
GLUT	9616	510	102	408
SGLT	4107	225	45	180
SWEET	2026	190	38	152
Total Proteins	15749	925	185	740

The original data set contains 15, 749 protein sequences in total among them the 3 types of glucose transport proteins individually have 9616, 4107 and 2026 proteins, respectively. After performing preprocessing by removing sequences with more than 40% identity using CD-HIT program, we left with 510 GLUT proteins, 225 SGLT proteins and 190 SWEET proteins and constituting 925 proteins overall. Finally, the resultant data was divided into training and testing datasets. Training dataset respectively comprised of 408, 180, 152 protein sequences for GLUT, SGLT and SWEET, while the remaining number of proteins were used to validate the performance of the proposed model.

As a preprocessing step we removed protein sequences having more than 40% similarity with the CD-HIT program, after that the resultant dataset was divided into training and testing for the 3 types of the Glucose Transporters.

We generated one-gram sequence with topmost 510 residues for each protein. BERT also append two special tokens (CLS and SEP) at the start and end of each protein sequence. Protein sequence with a length of 510 residues (as one-gram) were used as input to the BERT model,and it will output a vector for each amino acid with a length of 768 for BERT-Base and 1024 for BERT-large. Currently, we have 12 separate vectors each of length 768 for BERT-base and 1024 for BERT large corresponding to each of the amino acid residues. Therefore, we utilized the last hidden layer as a feature vector for our protein sequences. Moreover, the ultimate vectors were obtained by summing up individual feature vectores for each amino acid and then divided by the sequence length.

We applied the concept of Contextualize word embeddings using BERT because they capture the semantics of the words, and are more sensitive towards the context in which words exist in any sentence.

Support vector machine (SVM) having 10 as cost value and 0.0001 as gamma value along with Radial Basis Function (RBF) kernel has been implemented as the underlying classifier in all the experiments performed in this study because it outperformed all the other traditional machine-learning classifiers.

If you would like to build a model and evaluate our model, we provide the dataset as the below link.

Download dataset.zip

Important link of BERT pretrained models

Google released several new models of BERT, which are enhancement and improvement of the basic pre-processing code. In our study, we used four types of BERT models and we also developed our own BERT model which uses the configuration of the BERT base (uncased). The details about these models are given in table 2;

**Table 2.** Details for various BERT models along with the configuration and links to the respective models.
#	Model	Configurations	Link
1	BERT-Base (Uncased)	12-layer, 768-hidden, 12-heads, 110M parameters	Download from Google
2	BERT-Large (Uncased)	24-layer, 1024-hidden, 16-heads, 340M parameters	Download from Google
3	BERT-Base (Cased)	12-layer, 768-hidden, 12-heads , 110M parameters	Download from Google
4	BERT-Large (Cased)	24-layer, 1024-hidden, 16-heads, 340M parameters	Download from Google

Submission

In order to avoid the errors, please submit the sequence in fasta format (we also give you the fasta file examples). The user can choose two options to submit, including paste the sequence into text area and upload sequence file. The user can submit one single fasta file or multiple fasta file. In the result page, we show the results for the sequences which belong to transport proteins or not.

Sample fasta Sequence(s)

>O68460
MAGIYLFVVAAALAALGYGALTIKTIMAADAGTARMQEISGAVQEGASAFLNRQYKTIAV
VGAVVFVILTALLGISVGFGFLIGAVCSGIAGYVGMYISVRANVRVAAGAQQGLARGLEL
AFQSGAVTGMLVAGLALLSVAFYYILLVGIGATGRALIDPLVALGFGASLISIFARLGGG
IFTKGADVGADLVGKVEAGIPEDDPRNPAVIADNVGDNVGDCAGMAADLFETYAVTVVAT
MVLASIFFAGVPAMTSMMAYPLAIGGVCILASILGTKFVKLGPKNNIMGALYRGFLVSAG
ASFVGIILATAIVPGFGDIQGANGVLYSGFDLFLCAVIGLLVTGLLIWVTEYYTGTNFRP
VRSVAKASTTGHGTNVIQGLAISMEATALPALIICAAIITTYQLSGLFGIAITVTSMLAL
AGMVVALDAYGPVTDNAGGIAEMANLPEDVRKTTDALDAVGNTTKAVTKGYAIGSAGLGA
LVLFAAYTEDLAFFKANVDAYPAFAGVDVNFSLSSPYVVVGLFIGGLLPYLFGSMGMTAV
GRAAGSVVEEVRRQFREIPGIMEGTAKPEYGRCVDMLTKAAIKEMIIPSLLPVLAPIVLY
FVILGIADKSAAFSALGAMLLGVIVTGLFVAISMTAGGGAWDNAKKYIEDGHYGGKGSEA
HKAAVTGDTVGDPYKDTAGPAVNPMIKITNIVALLLLAVLAH
>O06342
MFPAAVGVLWQSGLRDPTPPGGPHGIEGLSLAFEKPSPVTALTQELRFATTMTGGVSLAI
WMAGVTREINLLAQASQWRRLGGTFPTNSQLTNESAASLRLYAQLIDLLDMVVDVDILSG
TSAGGINAALLASSRVTGSDLGGIRDLWLDLGALTELLRDPRDKKTPSLLYGDERIFAAL
AKRLPKLATGPFPPTTFPEAARTPSTTLYITTTLLAGETSRFTDSFGTLVQDVDLRGLFT
FTETDLARPDTAPALALAARSSASFPLAFEPSFLPFTKGTAKKGEVPARPAMAPFTSLTR
PHWVSDGGLLDNRPIGVLFKRIFDRPARRPVRRVLLFVVPSSGPAPDPMHEPPPDNVDEP
LGLIDGLLKGLAAVTTQSIAADLRAIRAHQDCMEARTDAKLRLAELAATLRNGTRLLTPS
LLTDYRTREATKQAQTLTSALLRRLSTCPPESGPATESLPKSWSAELTVGGDADKVCRQQ
ITATILLSWSQPTAQPLPQSPAELARFGQPAYDLAKGCALTVIRAAFQLARSDADIAALA
EVTEAIHRAWRPTASSDLSVLVRTMCSRPAIRQGSLENAADQLAADYLQQSTVPGDAWER
LGAALVNAYPTLTQLAASASADSGAPTDSLLARDHVAAGQLETYLSYLGTYPGRADDSRD
APTMAWKLFDLATTQRAMLPADAEIEQGLELVQVSADTRSLLAPDWQTAQQKLTGMRLHH
FGAFYKRSWRANDWMWGRLDGAGWLVHVLLDPRRVRWIVGERADTNGPQSGAQWFLGKLK
ELGAPDFPSPGYPLPAVGGGPAQHLTEDMLLDELGFLDDPAKPLPASIPWTALWLSQAWQ
QRVLEEELDGLANTVLDPQPGKLPDWSPTSSRTWATKVLAAHPGDAKYALLNENPIAGET
FASDKGSPLMAHTVAKAAATAAGAAGSVRQLPSVLKPPLITLRTLTLSGYRVVSLTKGIA
RSTIIAGALLLVLGVAAAIQSVTVFGVTGLIAAGTGGLLVVLGTWQVSGRLLFALLSFSV
VGAVLALATPVVREWLFGTQQQPGWVGTHAYWLGAQWWHPLVVVGLIALVAIMIAAATPG
RR
>P11166
MEPSSKKLTGRLMLAVGGAVLGSLQFGYNTGVINAPQKVIEEFYNQTWVHRYGESILPTT
LTTLWSLSVAIFSVGGMIGSFSVGLFVNRFGRRNSMLMMNLLAFVSAVLMGFSKLGKSFE
MLILGRFIIGVYCGLTTGFVPMYVGEVSPTALRGALGTLHQLGIVVGILIAQVFGLDSIM
GNKDLWPLLLSIIFIPALLQCIVLPFCPESPRFLLINRNEENRAKSVLKKLRGTADVTHD
LQEMKEESRQMMREKKVTILELFRSPAYRQPILIAVVLQLSQQLSGINAVFYYSTSIFEK
AGVQQPVYATIGSGIVNTAFTVVSLFVVERAGRRTLHLIGLAGMAGCAILMTIALALLEQ
LPWMSYLSIVAIFGFVAFFEVGPGPIPWFIVAELFSQGPRPAAIAVAGFSNWTSNFIVGM
CFQYVEQLCGPYVFIIFTVLLVLFFIFTYFKVPETKGRTFDEIASGFRQGGASQSDKTPE
ELFHPLGADSQV

GluBERT

Introduction

Methodology

Important link of BERT pretrained models

Submission

Members

Contact us