Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Guo, Xiaotong; Liu, Fulin; Ju, Ying; Wang, Zhen; Wang, Chunyu

doi:10.1038/srep28087

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Article
Open access
Published: 21 June 2016

Volume 6, article number 28087, (2016)
Cite this article

Download PDF

You have full access to this open access article

Scientific Reports

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Download PDF

Xiaotong Guo¹,
Fulin Liu¹,
Ying Ju²,
Zhen Wang² &
…
Chunyu Wang³

4717 Accesses
35 Citations
Explore all metrics

Abstract

Predicting protein subcellular location is necessary for understanding cell function. Several machine learning methods have been developed for computational prediction of primary protein sequences because wet experiments are costly and time consuming. However, two problems still exist in state-of-the-art methods. First, several proteins appear in different subcellular structures simultaneously, whereas current methods only predict one protein sequence in one subcellular structure. Second, most software tools are trained with obsolete data and the latest new databases are missed. We proposed a novel multi-label classification algorithm to solve the first problem and integrated several latest databases to improve prediction performance. Experiments proved the effectiveness of the proposed method. The present study would facilitate research on cellular proteomics.

Ensemble learning for protein multiplex subcellular localization prediction based on weighted KNN with different features

Article 09 September 2017

MIC_Locator: a novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy

Article Open access 26 October 2019

A New Subcellular Localization Predictor for Human Proteins Considering the Correlation of Annotation Features and Protein Multi-localization

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Introduction

Cells are highly ordered structure and contain various subcellular compartments that ensure the normal operation of the entire cell. These subcellular structures include nuclei, mitochondria, endoplasmic reticulum, Golgi apparatus, cell membrane and extracellular matrix. The biological function of cells is executed by its unique proteins. Protein synthesized on the ribosome must be transported to its corresponding subcellular structures to play a normal biological function. If protein subcellular localization does not correspond to its position, serious loss of function or disorder occurs in organisms. Researchers found the aberrant protein subcellular localization in some cell lesions (such as cancer cells)¹. The subcellular location of proteins is an important attribute of proteins, which is useful in determining protein function, revealing the mechanism of molecular interaction and understanding the complex physiological processes². The subcellular location of proteins is of great significance to cell biology, proteomics and drug design research³.

Using conventional biochemical research methods, such as cell separation method, electronic microscopy and fluorescence microscopy, to predict protein subcellular localization is expensive, time consuming and laborious⁴. In today’s post-genome era, large amounts of protein sequence provide raw materials for the development of biological information and a stage for machine learning methods’ application in the field of life scienc⁵.

The typical protein subcellular location system based on machine learning methods includes the following four basic steps: (1) establishment of protein data set, (2) protein sequence feature extraction, (3) design of multi-label classification algorithm and (4) construction of Web server⁶.

Databases for protein subcellular location, include LOCATE⁷, PSORTdb⁸, Arabidopsis Subcellular DB⁹, Yeast Subcellular DB¹⁰, Plant-PLoc¹¹, LOCtarget¹², LOC3D¹³, DBSubloc¹⁴ and PA-GOSUB¹⁵. However, none of the current works on computational protein subcellular localization have integrated these sources. Only part of the protein sequences were employed for training in previous works. In this paper, we collected existing related data sets and integrated a complete data set.

Feature extraction is a key process in various protein classification problems. Feature vectors are sometimes called as fingerprints of proteins. The common features include Chou’s PseACC representation¹⁶, K-mer and K-ship frequencies¹⁷, Chen’s 188D composition and physicochemical characteristics¹⁸, Wei’s secondary structure features^19,20 and PSSM matrix features²¹. Several web servers were also developed for feature extraction of protein primary sequence, including Pse-in-one²², Protrweb²³ and PseAAC²⁴.

Proper classifier can help to improve the prediction performance. Support vector machine (SVM), k-nearest neighbor (kNN), artificial neural network, random forest (RF)²⁵ and ensemble learning^26,27 are often employed for special peptide identification. However, subcellular localization of a protein in essence is a multi-label classification problem, which is different from methods for identifying cellular factors (multi classification learning). Recently, several multi-label classification methods have been employed for subcellular localization in different species, including human^28,29, plant³⁰, virus^31,32, eukaryote^33,34, animal³⁵. Features were also extracted according to n-gram³⁶, Chou’s PseAAC representation³⁷ and gene ontology³⁸. They all focused on the features construction. Only the basic multi-label strategies were employed. Most of their researches have transferred SVM to multi labels. We found that advanced ensemble multi-label learning techniques would further improve the performance.

Material and Methods

Integration of multiple protein subcellular localization sources

In this section, we reconstruct the training set for human protein subcellular localization study. The new data set has a richer source and we further reduce the redundancy with CD-HIT³⁹. Meanwhile, we expand the size of data sets, which render the training set data more comprehensive and provide a more convincing database for the multi-label classification learning step. The training set reconstruction will be introduced from two aspects, namely, data sources and data processing. The new dataset contains mainly two sources, which are LOCATE⁷ and Hum-mPLoc 2.0⁴⁰.

About 526 (480+43+3 = 526) protein sequences are recorded as multi-label sequences (no repeat), which have two or more types of subcellular sites (the number of sites P₁ is greater than or equals to 1) D_M1. The protein sequence distribution on each subcellular site is shown in Table 1.

Table 1 The protein sequences distribution on 14 subcellular sites.

Full size table

The subcellular sites contained in the proteins in Hum-mPLoc 2.0 are scarce, but parts of the protein data contain three or four subcellular sites. Proteins are rich and varied. Therefore, certain superiority is shown in terms of protein function.

From the LOCATE database, we directly obtained the document human.xml of the original XML format about subcellular localization of human. The document accommodates abundant information about human proteins. Our goal is to obtain 64,637 human protein amino acid FASTA sequences and the subcellular sites (site number P₂ is more than or equals to 1) of these sequences. After a rigorous data processing, we obtain the reference data set containing 6776 different protein sequences (no repeat) D₂. The 6776 protein sequences are distributed in 37 subcellular structures and possess two subcellular locations at most. Among these sequences, 4066 have only one type of subcellular location, which belongs to the single marker sequence data set D_S2. Approximately 2710 protein sequences have two subcellular locations (site number P₂ equals to 1), which belong to the multiple marker sequence data set D_M2. A total of 9486 (4066+2710*2 = 9486) protein sequences (proteins locative, a repetitive protein sequence) correspond to 37 subcellular locations. The protein sequence distribution on each subcellular site is shown in Table 2.

Table 2 The protein sequences distribution on 37 subcellular sites in LOCATE.

Full size table

Results of data processing indicate an extremely rich types of proteins and subcellular sites in the LOCATE database. However, the number of protein sequences, which have multiple subcellular sites, is relatively small, especially those belonging to three or more types of subcellular sites. This finding indicates that the protein data in the LOCATE have problems in functional diversity. To compensate for the limitations in the LOCATE database and Shen’s basic data set, we combine two types of data and reconstruct basic data sets. By combining Tables 1 and 2, we conclude that the 14 types of subcellular sites in Hum-mPLoc 2.0 are contained entirely in 37 types of subcellular sites in the LOCATE database, which is conducive to our data set reconstruction.

In order to prove the necessary of multi-label classification in the protein subcellular localization, it is required to compare the performances of multi-label and single-label classifiers. However, multi-label dataset cannot be used for single-label classifiers. Therefore, the data sets of multi-label protein sequences and single-label protein sequences were reconstructed separately, but they both come from the sources mentioned in the above section. The reconstructed data set was D_RM and the single labeled data set was D_RS. Therefore,

CD-HIT³⁹ is a software for reducing the similarity of the protein sequences. It can delete the similar sequences from the data set. Here we made the similarity of each pair sequences is less than 40%. Table 3 shows the protein sequences of the reconstructed data set D_R and the subcellular sites.

Table 3 Subcellular sites and protein sequences distribution in D_R.

Full size table

Features for subcellular localization

The above section mainly discusses a series of preprocessing with the data set. The reconstructed data set provides a reliable database for the study on the positioning method. This section focuses on specific features of protein subcellular localization based on machine learning.

In this section, three types of feature extraction methods are introduced based on the position-specific scoring matrix (PSSM)⁴¹, pseudo-amino acid composition⁴². In the long process of evolution, some characteristic genes are not eliminated but are selectively retained. These characteristics can effectively characterize the corresponding protein. Feature extraction methods based on PSSM are conducted to compare the protein sequence and rationally analyze with the invariance. PSSM matrix represents the comparison results between the input protein sequence and its homologous protein sequence in Swiss-Prot database. The multiple sequence alignment tools are HAlign⁴³ and PSI-BLAST⁴⁴ (position-specific initiated BLAST). Each input protein sequence generates a PSSM matrix after multiple sequence alignment. The elements in PSSM matrix characterize homology level between amino acids in some positions in the input protein sequence and the amino acid in the corresponding position in its homologous sequence. A smaller element value indicates higher conservation; lower conservation means that the amino acid in the position is prone to mutation. We extracted 20D and 420D features from the PSSM according to different parameters, which are described in detail in the supplementary materials.

The purpose of PseAAC is also to improve the accuracy of protein subcellular localization and the prediction of membrane protein. We extracted 188D features from PseACC, including 20D features of amino acid compositions, 24D features based on the contents of amino acids with certain physicochemical properties, 24D features of bivalent frequency and 120D features from eight physicochemical properties. It is described in detail in the supplementary materials, too.

Multi-label classification ensemble learning method

We employed the ensemble multi-label classification method for improving the prediction performance. There have been no ensemble methods for multi-label classification in bioinformatics so far. Next we described the ensemble voting strategies of our method.

Basic classifiers are denoted as and the labels are denoted as .

MeanEnsemble algorithm

The prediction result is the probability that the sample is predicted to be by . We calculate the average value of each column. Each training sample generates a set of q-dimensional vector:

v_j is the probability that the sample belongs to the corresponding class label. If 0.5 ≤ v_j ≤ 1, the sequence belongs to λ_j. If 0 ≤ v_j ≤ 0.5, the sequence does not belong to .

MajorityVoteEnsemble algorithm

Every basic classifier separately predicted a sample. The prediction result is S, S ∈ (−1, +1). If S = −1, the sample is recognized as the counterexample by the base classifier; otherwise, it is identified as a positive example. We calculate the average value of each column and each training sample generates a set of q-dimensional vector:

If v_j ≥ 0, the sample belongs to λ_j; otherwise, it does not.

TopKEnsemble algorithm

In each column in the result matrix, P accuracy values are sorted in descending order and the average of the first K (K is determined by p) accuracy values is calculated to obtain a set of q-dimensional vector:

If 0.5 ≤ v_j <1, the sequence belongs to . If 0 ≤ v_j <0.5, the sequence does not belong to .

The work flow of our protein subcellular localization prediction method can be shown in Fig. 1. In the data part, two sources of protein subcellular localization information were integrated. Then we tried three kinds of common features for representing the protein sequences. Multi-label classifier was employed for the prediction. The implementation was done with Mulan⁴⁵, which is an open source machine learning software tool.

Evaluation criteria and measurement

Average precision (AP)⁴⁶: AP refers to the average accuracy of multi-label classification. This index is positively related to multi-label classification system performance. If AP = 1, the classification effect is the best. The calculation formula of AP is as follows:

Here N is the number of all samples; |y_i| is the number of the samples with label y_i; rank(x_i,λ) means the prediction value (sometimes viewed as probability) of sample x_i with label λ. We use AP as a primary measure of our comparative experiment.

Results and Discussion

Contrast experiments based on 188-dimensional classical features

Experiment (1): Seven types of multi-labeled base classifiers are used to provide a fivefold cross validation for 188-dimensional feature^18,47 training set. Classification performance is shown in Fig. 2. Detail value is shown in the Table S1 in supplementary materials. We take AP as the main reference indicator and the AP values of the seven basic classifiers are shown in Fig. 2. The seven types of commonly used base classifiers in the experiment are random forest (RF), decision tree (J48), k nearest neighbor (IBK), logistic regression for multi-label classification (IBLR_ML)⁴⁸, k nearest neighbor for multi-label classification (MLkNN)⁴⁹, lazy multi-label classification (BRkNN)⁵⁰ and Hierarchy of multi-label learners (HOMER)⁵¹. The former three classifiers are single-label ones, while the latter four are multi-label classifiers.

IBLR_ML achieves the highest AP value of the cross validation (59.37%), whereas HOMER has the lowest value (34.88%). The AP values of RF and IBK are less than 50%. We abandon the above three base classifiers with lower AP values. The four basic classifiers with higher AP values, namely, J48, IBLR_ML, MLkNN and BRkNN, are integrated to the classification algorithm in Experiment (2).

Experiment (2): The four basic classifiers retained in Experiment (1) are integrated using our multi-label ensemble classification algorithms. We provide a fivefold cross validation for training sets. The AP values are shown in Fig. 3. Figure 3 demonstrates that the integration effect of MeanEnsemble multi-label ensemble classification algorithm for four types of base classifiers in Experiment (1) is optimal. The AP value is 61.70%.

The results of Experiments (1) and (2) show that the ensemble classification algorithm has a significant role in improving the accuracy of protein subcellular localization. We should notice that this is a serious imbalanced classification problem. The classifiers would prefer to the dominating labels. In the Table S4, we showed the detailed performances of individual subcellular locations. In the previous works, all the small classes were combined into a big class. We firstly tried to categorize 37 subcellular structures for prediction. Comparing with previous works, we have applied more subcellular structures and gotten more average accuracy.

Contrast experiments based on PSSM-20-dimensional feature

Experiment (3): Seven types of multi-labeled base classifiers are used to provide a fivefold cross validation for PSSM-20-dimensional feature training set. Classification performance is shown in Table S2 in the supplementary materials. Based on Table S2, we conclude that the AP value of fivefold cross validation that corresponds with PSSM-20d is better with better classification results. We still take AP as the main reference indicator and the AP values of the seven base classifiers are shown in Fig. 4.

The chart shows that the IBLR_ML classifier obtains the highest AP value (62.01%). It has improved appropriately compared with the validation result of 188-dimensional feature training set. The rest of the base classifiers’ training effects have different degrees of improvement compared with Experiment (1). The four base classifiers with higher AP values, namely, J48, IBLR_ML, MLkNN and BRkNN, are integrated to the classification algorithm in Experiment (4).

Experiment (4): We provide a fivefold cross validation for the training set with the same method as that in Experiment (2). The AP values are shown in Fig. 5.

The MeanEnsemble multi-label ensemble classification algorithm is still the best and better than the cross validation results of Experiment (2). The AP value reached 64.27%. TopKEnsemble and MajorityVoteEnsemble algorithms exhibit a larger increase compared with the training results in Experiment (2), but still less than the integrated effect of MeanEnsemble.

The results of Experiments (3) and (4) show that the ensemble classification algorithm has a significant role in improving the accuracy of protein subcellular localization again.

Contrast experiments based on PseAAC-420-dimensional feature

Experiment (5): Seven types of multi-labeled base classifiers are used to provide a fivefold cross validation for PseAAC-420-dimensional feature⁴² training set. Classification performance is shown in Table S3 in the supplementary materials. From Table S3 we can see that the AP values of fivefold cross validation that correspond with PseAAC-420d decline compared with 188d. The AP value of IBLR_ML is 56.36%, which is still the highest. It declines 3.01% and 5.65% compared with Experiments (1) and (3), respectively. The cross validation results are shown in Fig. 6.

The chart shows that the cross validation results of PseAAC-420-dimensional feature training set are the worst. The training results of the seven types of base classifiers decline compared with Experiments (1) and (3).

Experiment (4): We provide a fivefold cross validation for the training set with the same method as that in Experiment (4). The AP values are shown in Fig. 7.

Comparison with state-of-the-art methods

In order to prove the performance of our method, we compared with the latest protein subcellular localization web servers, including IMMMLGP²⁸, Hum-mPLoc 2.0⁴⁰, mGOF-Loc⁵². The first one is a multi-label classifier, while the other two can only predict as single class. So we employ D_RM for the multi-label classification and D_RS for single-label classification. Since there are both multi-label and single-label classifiers, we cannot compare in the multi-label measurements, including Macro-averaged Precision, Micro-averaged Precision, Macro-averaged F-Measure and Micro-averaged F-Measure. We just compare the average accuracy in the testing dataset. Table 4 showed the performance comparison in accuracy. From Table 4 we can see that our method outperformed the other latest methods. All of the accuracy rates come from 10-fold cross validation.

Table 4 Accuracy comparison with state-of-the-art methods.

Full size table

Besides that, we also tested our methods on other species, including plant, virus, eukaryote and animal. Related datasets and performance were show in Table S5 and S6 in the supplementary materials. We concluded that our methods can also work on other species. But the performances were all poorer than human dataset. It is due to our integrated human protein subcellular localization dataset is more complete than other species. We will continue to collect the other species protein subcellular localization data in the future.

Experiments analysis and discussion

We compare and analyze the training results of Experiments (1), (3) and (5) and Experiments (2), (4) and (6).

First, the seven cross validation results that correspond to PSSM-20-dimensional feature training set are better than the other two feature extraction algorithms. The IBLR_ML-based classifier shows the best performance, with the highest AP value of 62.01%. The contrast experimental results show that cross validation effects of PSSM-20 dimensional feature training set is the best for the base classifier.

Second, the cross validation results of MeanEnsemble, TopKEnsemble and MajorityVoteEnsemble on PSSM-20-dimensional feature training set are higher than those of 188d and PseAAC-420d. The advantages of PSSM-20d in multi-label ensemble classification are shown.

By comparing the experimental results of the two groups, we conclude that the 20-dimensional feature extraction algorithm based on the PSSM is the most effective for protein subcellular localization.

Then we compare and analyze the training results of Experiments (3) and (4). Based on the integrated effect, the algorithm MeanEnsemble effect is the best, with an AP value of 64.27%, which is higher than predicting AP of any type of base classifier. The algorithm performance of MajorityVoteEnsemble is the worst, with an AP value fivefold cross training of only 60.23%. This value is lower than the multi-label classification results of the base classifiers IBLR_ML, BRkNN and MLkNN with the same background data set, not embodying out the superiority of the integrated thought. It will be time consuming. By comparing the experimental results, we conclude that the multi-label classifier ensemble algorithm MeanEnsemble achieves the best effect for PSSM-20-dimensional feature training set. In the integrated four base classifiers, IBLR_ML shows the best multi-label learning performance.

Conclusion

Protein subcellular localization with computational methods is a multi-label classification problem. State-of-the-art prediction methods employ traditional single label machine learning. We proposed novel multi-label ensemble classification techniques with novel hybrid protein features. Experiments proved the effectiveness of our features and the ensemble strategy. Several recent works have proved that ensemble learning⁵³ and feature reduction⁵⁴ can improve the performance of weak learning problems. However, the present work employed the simplest voting strategy and did not conduct any feature reduction techniques. Moreover, class imbalance occurred in protein subcellular localization problems. Imbalance learning for binary classification has been developed and applied in bioinformatics research^55,56. However, no imbalance learning techniques exist for multi-class and multi-label classification. All these problems and application on large data⁵⁷ would be investigated in future work.

Additional Information

How to cite this article: Guo, X. et al. Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier. Sci. Rep. 6, 28087; doi: 10.1038/srep28087 (2016).

References

LaQuaglia, M. J. et al. YAP protein expression and subcellular localization in pediatric liver tumors. CANCER RES 75, 2107–2107 (2015).
Google Scholar
Huh, W.-K. et al. Global analysis of protein localization in budding yeast. NATURE 425, 686–691 (2003).
ADS CAS PubMed Google Scholar
Maliepaard, M. et al. Subcellular localization and distribution of the breast cancer resistance protein transporter in normal human tissues. CANCER RES 61, 3458–3464 (2001).
CAS PubMed Google Scholar
Camp, R. L., Chung, G. G. & Rimm, D. L. Automated subcellular localization and quantification of protein expression in tissue microarrays. NAT MED 8, 1323–1328 (2002).
CAS PubMed Google Scholar
Gardy, J. L. & Brinkman, F. S. Methods for predicting bacterial protein subcellular localization. NAT REV MICROBIOL 4, 741–751 (2006).
CAS PubMed Google Scholar
Wang, Z., Zou, Q., Jiang, Y., Ju, Y. & Zeng, X. Review of protein subcellular localization prediction. CURR BIOINFORM 9, 331–342 (2014).
CAS Google Scholar
Sprenger, J. et al. LOCATE: a mammalian protein subcellular localization database. NUCLEIC ACIDS RES 36, D230–D233 (2008).
CAS PubMed Google Scholar
Rey, S. et al. PSORTdb: a protein subcellular localization database for bacteria. NUCLEIC ACIDS RES 33, D164–D168 (2005).
CAS PubMed Google Scholar
Li, S., Ehrhardt, D. W. & Rhee, S. Y. Systematic analysis of Arabidopsis organelles and a protein localization database for facilitating fluorescent tagging of full-length Arabidopsis proteins. PLANT PHYSIOL 141, 527–539 (2006).
CAS PubMed PubMed Central Google Scholar
Kumar, A. et al. Subcellular localization of the yeast proteome. GENE DEV 16, 707–719 (2002).
CAS PubMed Google Scholar
Chou, K. C. & Shen, H. B. Large‐scale plant protein subcellular location prediction. J CELL BIOCHEM 100, 665–678 (2007).
CAS PubMed Google Scholar
Nair, R. & Rost, B. LOCnet and LOCtarget: sub-cellular localization for structural genomics targets. NUCLEIC ACIDS RES 32, W517–W521 (2004).
CAS PubMed PubMed Central Google Scholar
Nair, R. & Rost, B. LOC3D: annotate sub-cellular localization for protein structures. NUCLEIC ACIDS RES 31, 3337–3340 (2003).
CAS PubMed PubMed Central Google Scholar
Guo, T., Hua, S., Ji, X. & Sun, Z. DBSubLoc: database of protein subcellular localization. NUCLEIC ACIDS RES 32, D122–D124 (2004).
CAS PubMed PubMed Central Google Scholar
Lu, P. et al. PA-GOSUB: a searchable database of model organism protein sequences with their predicted Gene Ontology molecular function and subcellular localization. NUCLEIC ACIDS RES 33, D147–D153 (2005).
CAS PubMed Google Scholar
Du, P., Wang, X., Xu, C. & Gao, Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. ANAL BIOCHEM 425, 117–119 (2012).
CAS PubMed Google Scholar
Liu, B. et al. Identification of microRNA precursor with the degenerate K-tuple or Kmer strategy. J THEOR BIOL 385, 153–159 (2015).
CAS PubMed Google Scholar
Cai, C., Han, L., Ji, Z. L., Chen, X. & Chen, Y. Z. SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence. NUCLEIC ACIDS RES 31, 3692–3697 (2003).
CAS PubMed PubMed Central Google Scholar
Wei, L., Liao, M., Gao, X. & Zou, Q. An Improved Protein Structural Prediction Method by Incorporating Both Sequence and Structure Information. IEEE T NANOBIOSCI 14, 339–349 (2015).
Google Scholar
Wei, L., Liao, M., Gao, X. & Zou, Q. Enhanced Protein Fold Prediction Method through a Novel Feature Extraction Technique. IEEE T NANOBIOSCI 14, 649–659 (2015).
Google Scholar
Xu, R. et al. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC SYST BIOL 9, S10 (2015).
PubMed PubMed Central Google Scholar
Liu, B. et al. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA and protein sequences. NUCLEIC ACIDS RES 43, W65–W71 (2015).
CAS PubMed PubMed Central Google Scholar
Xiao, N., Cao, D. S., Zhu, M. F. & Xu, Q. S. protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences. BIOINFORM 31, 1857–1859 (2015).
CAS Google Scholar
Shen, H.-B. & Chou, K.-C. PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. ANAL BIOCHEM 373, 386–388 (2008).
CAS PubMed Google Scholar
Zhao, X., Zou, Q., Liu, B. & Liu., X. Exploratory predicting protein folding model with random forest and hybrid features. CURR PROTEOMICS 11, 289–299 (2014).
CAS Google Scholar
Zou, Q. et al. Improving tRNAscan-SE annotation results via ensemble classifiers. MOL INFORM 34, 761–770 (2015).
CAS PubMed Google Scholar
Wang, C., Hu, L., Guo, M., Liu, X. & Zou, Q. imDC: an ensemble learning method for imbalanced classification with miRNA data. GENET MOL RES 14, 123–133 (2015).
CAS PubMed Google Scholar
He, J., Gu, H. & Liu, W. Imbalanced multi-modal multi-label learning for subcellular localization prediction of human proteins with both single and multiple sites. Plos one 7, e37155 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Mei, S. Multi-label multi-kernel transfer learning for human protein subcellular localization. Plos one 7, e37716 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Wu, Z.-C., Xiao, X. & Chou, K.-C. iLoc-Plant: a multi-label classifier for predicting the subcellular localization of plant proteins with both single and multiple sites. MOL BIOSYST 7, 3287–3297 (2011).
CAS PubMed Google Scholar
Xiao, X., Wu, Z.-C. & Chou, K.-C. iLoc-Virus: A multi-label learning classifier for identifying the subcellular localization of virus proteins with both single and multiple sites. J THEOR BIOL 284, 42–51 (2011).
CAS PubMed MATH Google Scholar
Wang, X., Li, G.-Z. & Lu, W.-C. Virus-ECC-mPLoc: a multi-label predictor for predicting the subcellular localization of virus proteins with both single and multiple sites based on a general form of Chou’s pseudo amino acid composition. PROTEIN PEPTIDE LETT 20, 309–317 (2013).
CAS Google Scholar
Chou, K.-C., Wu, Z.-C. & Xiao, X. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. Plos one 6, e18258 (2011).
ADS CAS PubMed PubMed Central Google Scholar
Wang, X. & Li, G.-Z. A multi-label predictor for identifying the subcellular locations of singleplex and multiplex eukaryotic proteins. Plos one 7, e36317 (2012).
ADS CAS PubMed PubMed Central Google Scholar
Lin, W.-Z., Fang, J.-A., Xiao, X. & Chou, K.-C. iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. MOL BIOSYST 9, 634–644 (2013).
CAS PubMed Google Scholar
Xiao, X., Wu, Z.-C. & Chou, K.-C. A multi-label classifier for predicting the subcellular localization of gram-negative bacterial proteins with both single and multiple sites. Plos one 6, e20592 (2011).
ADS CAS PubMed PubMed Central Google Scholar
Mei, S. Predicting plant protein subcellular multi-localization by Chou’s PseAAC formulation based multi-label homolog knowledge transfer learning. J THEOR BIOL 310, 80–87 (2012).
MathSciNet CAS PubMed MATH Google Scholar
Wan, S., Mak, M.-W. & Kung, S.-Y. mGOASVM: Multi-label protein subcellular localization based on gene ontology and support vector machines. BMC BIOINFORM 13, 1 (2012).
Google Scholar
Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. BIOINFORM 28, 3150–3152 (2012).
CAS Google Scholar
Shen, H.-B. & Chou, K.-C. A top-down approach to enhance the power of predicting human protein subcellular localization: Hum-mPLoc 2.0. ANAL BIOCHEM 394, 269–274 (2009).
CAS PubMed Google Scholar
Chou, K.-C. & Shen, H.-B. MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. BIOCHEM BIOPH RES CO 360, 339–345 (2007).
CAS Google Scholar
Chou, K. C. Prediction of protein cellular attributes using pseudo‐amino acid composition. PROTEIN: STRUC, FUNC, & BIOINFORM 43, 246–255 (2001).
CAS Google Scholar
Zou, Q., Hu, Q., Guo, M. & Wang, G. HAlign: Fast multiple similar DNA/RNA sequence alignment based on the centre star strategy. BIOINFORM 31, 2475–2481 (2015).
CAS Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J MOL BIOL 215, 403–410 (1990).
CAS PubMed Google Scholar
Tsoumakas, G., Spyromitros-Xioufis, E., Vilcek, J. & Vlahavas, I. MULAN: A Java library for multi-label learning. J MACH LEARN RES 12, 2411–2414 (2011).
MathSciNet MATH Google Scholar
Zhou, Z.-H., Zhang, M.-L., Huang, S.-J. & Li, Y.-F. Multi-instance multi-label learning. ARTIF INTELL 176, 2291–2320 (2012).
MathSciNet MATH Google Scholar
Lin, C. et al. Hierarchical classification of protein folds using a novel ensemble classifier. Plos one 8, e56499 (2013).
ADS CAS PubMed PubMed Central Google Scholar
Cheng, W. & Hüllermeier, E. Combining instance-based learning and logistic regression for multilabel classification. MACH LEARN 76, 211–225 (2009).
Google Scholar
Zhang, M.-L. & Zhou, Z.-H. ML-KNN: A lazy learning approach to multi-label learning. PATTERN RECOGN 40, 2038–2048 (2007).
MATH Google Scholar
Spyromitros, E., Tsoumakas, G. & Vlahavas, I. An empirical study of lazy multilabel classification algorithms. In AI:THE, MOD & APP 401–406 (Springer, 2008).
Tsoumakas, G., Katakis, I. & Vlahavas, I. Effective and efficient multilabel classification in domains with large number of labels. In Proc. ECML/PKDD 2008 MMD’08. 30–44.
Wei, L., Liao, M., Gao, X., Wang, J. & Lin, W. mGOF-Loc: A Novel Ensemble Learning Method for Human Protein Subcellular Localization Prediction. (2016) Available at: http://server.malab.cn/mGOF-loc/Index.html (Accessed: 5th May 2016).
Lin, C. et al. LibD3C: ensemble classifiers with a clustering and dynamic selection strategy. NEUROCOMP 123, 424–435 (2014).
Google Scholar
Zou, Q., Zeng, J., Cao, L. & Ji, R. A novel features ranking metric with application to scalable visual and bioinformatics data classification. NEUROCOMP 173, 346–354 (2016).
Google Scholar
Song, L. et al. nDNA-prot: Identification of DNA-binding Proteins Based on Unbalanced Classification. BMC BIOINFORM 15, 298 (2014).
Google Scholar
Zou, Q., Xie, S., Lin, Z., Wu, M. & Ju, Y. Finding the best classification threshold in imbalanced classification. BIG DATA RES, doi: 10.1016/j.bdr.2015.12.001 (2016).
Zou, Q. et al. Survey of MapReduce Frame Operation in Bioinformatics. BRIEF BIOINFORM 15, 637–647 (2014).
PubMed Google Scholar

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China (No. 61402132).

Author information

Authors and Affiliations

School of Instrumentation Science and Opto-electronics Engineering, Beihang University, Beijing, China
Xiaotong Guo & Fulin Liu
School of Information Science and Technology, Xiamen University, Xiamen, China
Ying Ju & Zhen Wang
School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
Chunyu Wang

Authors

Xiaotong Guo
View author publications
You can also search for this author in PubMed Google Scholar
Fulin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ying Ju
View author publications
You can also search for this author in PubMed Google Scholar
Zhen Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chunyu Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

X.T.G. initially drafted the manuscript and did most of the codes work. F.L.L. helped to collect the protein localization data. Y.J. helped to revise the English. Z.W. participated in the design of the experiments. C.Y.W. guided the whole works and helped to draft the manuscript. All authors read and approved the final manuscript.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Electronic supplementary material

Supplementary Information

Rights and permissions

This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Reprints and permissions

About this article

Cite this article

Guo, X., Liu, F., Ju, Y. et al. Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier. Sci Rep 6, 28087 (2016). https://doi.org/10.1038/srep28087

Download citation

Received: 24 February 2016
Accepted: 26 May 2016
Published: 21 June 2016
DOI: https://doi.org/10.1038/srep28087
Springer Nature Limited

This article is cited by

PlantMWpIDB: a database for the molecular weight and isoelectric points of the plant proteomes
- Tapan Kumar Mohanta
- Muhammad Shahzad Kamran
- Gyu Sang Choi
Scientific Reports (2022)
Ensemble of classifier chains and decision templates for multi-label classification
- Victor Freitas Rocha
- Flávio Miguel Varejão
- Marcelo Eduardo Vieira Segatto
Knowledge and Information Systems (2022)
Automated classification of protein subcellular localization in immunohistochemistry images to reveal biomarkers in colon cancer
- Zhen-Zhen Xue
- Yanxia Wu
- Ying-Ying Xu
BMC Bioinformatics (2020)
Global multi-output decision trees for interaction prediction
- Konstantinos Pliakos
- Pierre Geurts
- Celine Vens
Machine Learning (2018)

Human Protein Subcellular Localization with Integrated Source and Multi-label Ensemble Classifier

Abstract

Similar content being viewed by others

Explore related subjects

Introduction

Material and Methods

Integration of multiple protein subcellular localization sources

Features for subcellular localization

Multi-label classification ensemble learning method

MeanEnsemble algorithm

MajorityVoteEnsemble algorithm

TopKEnsemble algorithm

Evaluation criteria and measurement

Results and Discussion

Contrast experiments based on 188-dimensional classical features

Contrast experiments based on PSSM-20-dimensional feature

Contrast experiments based on PseAAC-420-dimensional feature

Comparison with state-of-the-art methods

Experiments analysis and discussion

Conclusion

Additional Information

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Ethics declarations

Competing interests

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Navigation