Abstract
The discovery of novel cancer genes is one of the main goals in cancer research. Bioinformatics methods can be used to accelerate cancer gene discovery, which may help in the understanding of cancer and the development of drug targets. In this paper, we describe a classifier to predict potential cancer genes that we have developed by integrating multiple biological evidence, including protein-protein interaction network properties, and sequence and functional features. We detected 55 features that were significantly different between cancer genes and non-cancer genes. Fourteen cancer-associated features were chosen to train the classifier. Four machine learning methods, logistic regression, support vector machines (SVMs), BayesNet and decision tree, were explored in the classifier models to distinguish cancer genes from non-cancer genes. The prediction power of the different models was evaluated by 5-fold cross-validation. The area under the receiver operating characteristic curve for logistic regression, SVM, Baysnet and J48 tree models was 0.834, 0.740, 0.800 and 0.782, respectively. Finally, the logistic regression classifier with multiple biological features was applied to the genes in the Entrez database, and 1976 cancer gene candidates were identified. We found that the integrated prediction model performed much better than the models based on the individual biological evidence, and the network and functional features had stronger powers than the sequence features in predicting cancer genes.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Vogelstein B, Kinzler K W. Cancer genes and the pathways they control. Nat Med, 2004, 10: 789–799
Futreal P A, Coin L, Marshall M, et al. A census of human cancer genes. Nat Rev Cancer, 2004, 4: 177–183
Strausberg R L, Simpson A J, Wooster R. Sequence-based cancer genomics: progress, lessons and opportunities. Nat Rev Genet, 2003, 4: 409–418
Altshuler D, Daly M J, Lander E S. Genetic mapping in human disease. Science, 2008, 322: 881–888
Aragues R, Sander C, Oliva B. Predicting cancer involvement of genes from heterogeneous data. BMC Bioinformatics, 2008, 9: 172
Furney S J, Higgins D G, Ouzounis C A, et al. Structural and functional properties of genes involved in human cancer. BMC Genomics, 2006, 7: 3
Ostlund G, Lindskog M, Sonnhammer E L. Network-based Identification of novel cancer genes. Mol Cell Proteomics, 2010, 9: 648–655
Li L, Zhang K, Lee J, et al. Discovering cancer genes by integrating network and functional properties. BMC Med Genomics, 2009, 2: 61
Wang E, Lenferink A, O’Connor-McCourt M. Cancer systems biology: exploring cancer-associated genes on cellular networks. Cell Mol Life Sci, 2007, 64: 1752–1762
Milenkovic T, Memisevic V, Ganesan A K, et al. Systems-level cancer gene identification from protein interaction network topology applied to melanogenesis-related functional genomics data. J R Soc, 2010, 7: 423–437
Brown K R, Jurisica I. Online predicted human interaction database. Bioinformatics, 2005, 21: 2076–2082
Alfarano C, Andrade C E, Anthony K, et al. The Biomolecular Interaction Network Database and related tools 2005 update. Nucleic Acids Res, 2005, 33: D418–D424
Peri S, Navarro J D, Kristiansen T Z, et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res, 2004, 32: D497–D501
Chatr-aryamontri A, Ceol A, Palazzi L M, et al. MINT: the Molecular INTeraction database. Nucleic Acids Res, 2007, 35: D572–D574
Cui Q, Ma Y, Jaramillo M, et al. A map of human cancer signaling. Mol Syst Biol, 2007, 3: 152
Hamosh A, Scott A F, Amberger J S, et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, 2005, 33: D514–D517
D’Antonio M, Pendino V, Sinha S, et al. Network of Cancer Genes (NCG 3.0): integration and analysis of genetic and network properties of cancer genes. Nucleic Acids Res, 2012, 40: D978–D983
Maglott D, Ostell J, Pruitt K D, et al. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res, 2007, 35: D26–D31
Tu Z, Wang L, Xu M, et al. Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics, 2006, 7: 31
Frank E, Hall M, Trigg L, et al. Data mining in bioinformatics using Weka. Bioinformatics, 2004, 20: 2479–2481
Hanley J A, McNeil B J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 1982, 143: 29–36
Xu J, Li Y. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics, 2006, 22: 2800–2805
Kyte J, Doolittle R F. A simple method for displaying the hydropathic character of a protein. J Mol Biol, 1982, 157: 105–132
Bakheet T M, Doig A J. Properties and identification of human protein drug targets. Bioinformatics, 2009, 25: 451–457
Harris M A, Clark J, Ireland A, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, 2004, 32: D258–D261
Huang da W, Sherman B T, Lempicki R A. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res, 2009, 37: 1–13
Author information
Authors and Affiliations
Corresponding author
Additional information
This article is published with open access at Springerlink.com
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Liu, W., Xie, H. Predicting potential cancer genes by integrating network properties, sequence features and functional annotations. Sci. China Life Sci. 56, 751–757 (2013). https://doi.org/10.1007/s11427-013-4500-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11427-013-4500-6