A web server for identify electron transport proteins using word embedings
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. In order to identify proteins, classification performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences are treated as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint multiple features was examined for such feature selection. The support vector machine algorithm consequentially was employed to identify electron transport proteins.
Figure 1: The process of cellular respiration in which ATP molecules, the energy source of the cells, was created
Figure 2: The flowchart of this study
The statistics of models within the 5-fold cross validation including average accuracy, specificity, sensitivity as well as MCC rates are 98.46%, 99.36%, 95.26%, and 0.955, respectively. Such metrics in the independent test are 96.82%, 97.16%, 95.76%, and 0.9, respectively. Compared to state-of-the-art predictors, the prososed method can generate more preferable performance above all metrics. These figures indicated the proposed classification model effectiveness with the task of determining electron transport proteins. Furthermore, this study replenishes a basis for futuristic research which enables the enrichment of natural language processing tactics in bioinformatics research.
The dataset used in this server were retrieved from UniProt. The detail of the dataset is listed in the table below.
Class name | Number of proteins | |||
Original | After 30% similarity check and preprocessing | Train dataset | Test dataset | |
Electron transport | 12,832 | 1,324 | 1,091 | 208 |
General transport | 10,814 | 4,569 | 3,846 | 713 |
If you would like to build a model and evaluate our model, we provide the dataset as the below link.
Download dataset.zipIn order to avoid the errors, please submit the sequence in fasta format (we also give you the fasta file examples). The user can choose two options to submit, including paste the sequence into text area and upload sequence file. The user can submit one single fasta file or multiple fasta file. In the result page, we show the result for a sequence with a probability that it belong to tumor necrosis factors or not.
>sp|A2XVZ1|NDHM_ORYSI NAD(P)H-quinone oxidoreductase subunit M, chloroplastic OS=Oryza sativa subsp. indica OX=39946 GN=ndhM PE=3 SV=1 MATTASPFLSPAKLSLERRLPRATWTARRSVRFPPVRAQDQQQQVKEEEEEAAVENLPPP PQEEEQRRERKTRRQGPAQPLPVQPLAESKNMSREYGGQWLSCTTRHIRIYAAYINPETN AFDQTQMDKLTLLLDPTDEFVWTDETCQKVYDEFQDLVDHYEGAELSEYTLRLIGSDLEH FIRKLLYDGEIKYNMMSRVLNFSMGKPRIKFNSSQIPDVK >sp|P31039|SDHA_BOVIN Succinate dehydrogenase [ubiquinone] flavoprotein subunit, mitochondrial OS=Bos taurus OX=9913 GN=SDHA PE=1 SV=3 MSGVAAVSRLWRARRLALTCTKWSAAWQTGTRSFHFTVDGNKRSSAKVSDAISAQYPVVD HEFDAVVVGAGGAGLRAAFGLSEAGFNTACVTKLFPTRSHTVAAQGGINAALGNMEEDNW RWHFYDTVKGSDWLGDQDAIHYMTEQAPASVVELENYGMPFSRTEDGKIYQRAFGGQSLK FGKGGQAHRCCCVADRTGHSLLHTLYGRSLRYDTSYFVEYFALDLLMESGECRGVIALCI EDGSIHRIRARNTVIATGGYGRTYFSCTSAHTSTGDGTAMVTRAGLPCQDLEFVQFHPTG IYGAGCLITEGCRGEGGILINSQGERFMERYAPVAKDLASRDVVSRSMTLEIREGRGCGP EKDHVYLQLHHLPPAQLAMRLPGISETAMIFAGVDVTKEPIPVLPTVHYNMGGIPTNYKG QVLRHVNGQDQGVPGLYACGEAACASVHGANRLGANSLLDLVVFGRACALSIAESCRPGD KVPSIKPNAGEESVMNLDKLRFANGSIRTSELRLNMQKSMQSHAAVFRVGSVLQEGCEKI SSLYGDLRHLKTFDRGMVWNTDLVETLELQNLMLCALQTIYGAEARKESRGGPRREDFKE RVDEYDYSKPIQGQQKKPFEQHWRKHTLSYVDIKTGKVTLEYRPVIDRTLNETDCATVPP AIGSY >sp|O31214|UCRI_ALLVD Ubiquinol-cytochrome c reductase iron-sulfur subunit OS=Allochromatium vinosum (strain ATCC 17899 / DSM 180 / NBRC 103801 / NCIMB 10441 / D) OX=572477 GN=petA PE=3 SV=2 MLASAGGYWPMSAQGVNKMRRRVLVAATSVVGAVGAGYALVPFVASMNPSARARAAGAPV EADISKLEPGALLRVKWRGKPVWVVHRSPEMLAALSSNDPKLVDPTSEVPQQPDYCKNPT RSIKPEYLVAIGICTHLGCSPTYRPEFGPDDLGSDWKGGFHCPCHGSRFDLAARVFKNVP APTNLVIPKHVYLNDTTILIGEDRGSA >sp|E0TW67|QOX2_BACPZ Quinol oxidase subunit 2 OS=Bacillus subtilis subsp. spizizenii (strain ATCC 23059 / NRRL B-14472 / W23) OX=655816 GN=qoxA PE=1 SV=2 MIFLFRALKPLLVLALLTVVFVLGGCSNASVLDPKGPVAEQQSDLILLSIGFMLFIVGVV FVLFTIILVKYRDRKGKDNGSYNPKIHGNTFLEVVWTVIPILIVIALSVPTVQTIYSLEK APEATKDKEPLVVHATSVDWKWVFSYPEQDIETVNYLNIPVDRPILFKISSADSMASLWI PQLGGQKYAMAGMLMDQYLQADEVGTYQGRNANFTGEHFADQEFDVNAVTEKDFNSWVKK TQNEAPKLTKEKYDQLMLPENVDELTFSSTHLKYVDHGQDAEYAMEARKRLGYQAVSPHS KTDPFENVKENEFKKSDDTEE >sp|A5GCQ9|VATE_GEOUR V-type ATP synthase subunit E OS=Geobacter uraniireducens (strain Rf4) OX=351605 GN=atpE PE=3 SV=1 MGYVELIAALRRDGEEQLEKIRSDAEREAERVKGDASARIERLRAEYAERLASLEAAQAR AILADAESKASSIRLATESALAVRLFLLARSSLHHLRDEGYEQLFADLVRELPPGEWRRV VVNPADMALAARHFPNAEIVSHPAIVGGLEVSEEGGSISVVNTLEKRMERAWPELLPEIL RDIYREL
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Department of Computer Science and Engineering
Yuan Ze University
135 Yuan-Tung Road, Chung-Li, Taiwan 32003, R.O.C.
Professional Master Program in Artificial Intelligence in Medicine
Taipei Medical univeristy
Taipei City 106, Taiwan
Deparment of Statistics – Informatics
University of Economics, University of Danang
71 Ngu Hanh Son St, Danang, Vietnam 550000
If you have any problem or suggest any idea for our website, feel free to contact us via email: [email protected]