article

Accurate Cancer Classification Using Expressions of Very Few Genes

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4 Issue 1pp 40–53https://doi.org/10.1109/TCBB.2007.1006

Published:01 January 2007Publication History

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

We aim at finding the smallest set of genes that can ensure highly accurate classification of cancers from microarray data by using supervised machine learning algorithms. The significance of finding the minimum gene subsets is three-fold: 1) It greatly reduces the computational burden and "noise” arising from irrelevant genes. In the examples studied in this paper, finding the minimum gene subsets even allows for extraction of simple diagnostic rules which lead to accurate diagnosis without the need for any classifiers. 2) It simplifies gene expression tests to include only a very small number of genes rather than thousands of genes, which can bring down the cost for cancer testing significantly. 3) It calls for further investigation into the possible biological relationship between these small numbers of genes and cancer development and treatment. Our simple yet very effective method involves two steps. In the first step, we choose some important genes using a feature importance ranking scheme. In the second step, we test the classification capability of all simple combinations of those important genes by using a good classifier. For three "small” and "simple” data sets with two, three, and four cancer (sub)types, our approach obtained very high accuracy with only two or three genes. For a "large” and "complex” data set with 14 cancer types, we divided the whole problem into a group of binary classification problems and applied the 2--step approach to each of these binary classification problems. Through this "divide-and-conquer” approach, we obtained accuracy comparable to previously reported results but with only 28 genes rather than 16,063 genes. In general, our method can significantly reduce the number of genes required for highly reliable diagnosis.

References

{1} M. Schena, D. Shalon, R.W. Davis, and P.O. Brown, "Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray," Science, vol. 270, pp. 467-470, 1995.Google Scholar
{2} J.M. Khan et al., "Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, pp. 673-679, 2001.Google Scholar
{3} J. Deutsch, "Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction," Bioinformatics, vol. 19, pp. 45-52, 2003.Google ScholarCross Ref
{4} R. Tibshirani, T. Hastie, B. Narashiman, and G. Chu, "Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6567-6572, 2002.Google ScholarCross Ref
{5} A.A. Alizadeh et al., "Distinct Types of Diffuse Large b-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-511, 2000.Google Scholar
{6} R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, "Class Predicition by Nearest Shrunken Centroids with Applications to DNA Microarrays," Statistical Science, vol. 18, pp. 104-117, 2003.Google ScholarCross Ref
{7} Y. Lee and C.K. Lee, "Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data," Bioinformatics, vol. 19, pp. 1132-1139, 2003.Google ScholarCross Ref
{8} S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.Google ScholarCross Ref
{9} X. Chen et al., "Gene Expression Patterns in Human Liver Cancers," Molecular Biology of the Cell, vol. 13, pp. 1929-1939, 2002.Google ScholarCross Ref
{10} C. Ambroise and G.J. McLachlan, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.Google ScholarCross Ref
{11} J. Devore and R. Peck, Statistics: The Exploration and Analysis of Data, third ed. Duxbury Press, 1997.Google Scholar
{12} Y. Lai, B. Wu, L. Chen, and H. Zhao, "Statistical Method for Identifying Differential Gene-Gene Coexpression Patterns," Bioinformatics , vol. 20, pp. 3146-3155, 2004. Google ScholarDigital Library
{13} P. Broet, A. Lewin, S. Richardson, C. Dalmasso, and H. Magdelenat, "A Mixture Model-Based Strategy for Selecting Sets of Genes in Multiclass Response Microarray Experiments," Bioinformatics, vol. 20, pp. 2562-2571, 2004. Google ScholarDigital Library
{14} Y. Frayman, L. Wang, and C. Wan, "Cold Rolling Mill Thickness Control Using the Cascade-Correlation Neural Network," Control and Cybernetics, vol. 31, pp. 327-342, 2002.Google Scholar
{15} Y. Frayman and L. Wang, "Data Mining Using Dynamically Constructed Fuzzy Neural Networks," Lecture Notes in Artificial Intelligence, vol. 1394, pp. 122-131, 1998. Google ScholarDigital Library
{16} V. Vapnik, Statistical Learning Theory. Wiley, 1998. Google ScholarDigital Library
{17} M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares Jr., and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 262-267, 2000.Google ScholarCross Ref
{18} O. Troyanskaya et al., "Missing Value Estimation Methods for DNA Microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.Google ScholarDigital Library
{19} M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson Jr., J.R. Marks, and J.R. Nevins, "Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 11 462-11 467, 2001.Google ScholarCross Ref
{20} E. Freyhult, P. Prusis, M. Lapinsh, J.E. Wikberg, V. Moulton, and M.G. Gustafsson, "Unbiased Descriptor and Parameter Selection Confirms the Potential of Proteochemometric Modelling," BMC Bioinformatics, vol. 6, no. 50, 2005.Google Scholar
{21} S. Dudoit, M.J.V.D. Laan, S. Keles, A.M. Molinaro, S.E. Sinisi, and S.L. Teng, "Loss-Based Estimation with Cross-Validation: Application to Microarray Data Analysis and Motif Finding," Univ of California Berkeley Division of Biostatistics Working Paper Series, no. 137, 2003, http://www.bepress.com/ucbbiostat/paper137.Google Scholar
{22} A. Barrier, M.J.V.D. Laan, and S. Dudoit, "Prognosis of Stage II Colon Cancer by Non-Neoplastic Mucosa Gene Expression Profiling," Univ. of California Berkeley Division of Biostatistics Working Paper Series, no. 179, 2003, http://www.bepress.com/ ucbbiostat/paper179.Google Scholar
{23} C.C. Chang and C.J. Lin, "A Comparison of Methods for Multi-Class Support Vector Machines," IEEE Trans. Neural Network, vol. 13, pp. 415-425, 2002. Google ScholarDigital Library
{24} S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub, "Multiclass Cancer Diagnosis Using Tumor Gene Expression Signature," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 15 149-15 154, 2000.Google Scholar
{25} T. Li, C. Zhang, and M. Ogihara, "A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression," Bioinformatics, vol. 20, pp. 2429-2437, 2004. Google ScholarDigital Library
{26} A.C. Tan, D.Q. Naiman, L. Xu, R.L. Winslow, and D. Geman, "Simple Decision Rules for Classifying Human Cancers from Gene Expression Profiles," Bioinformatics, vol. 21, pp. 3896-3904, 2005. Google ScholarDigital Library
{27} T. Mitchell, Machine Learning. McGraw-Hill, 1998.Google Scholar
{28} I.M. Ambros, P.F. Ambros, S. Strehl, H. Kovar, H. Gadner, and M. Salzer-Kuntschik, "MIC2 Is a Specific Marker for Ewing's Sarcoma and Peripheral Primitive Neuraoectodermal Tumor. Evidence for a Common Histogenesis of Ewing's Sarcoma and Peripheral Primitive Neuroectodermal Tumors from MIC2 Expression and Specific Chromosome Aberration," Cancer, vol. 67, pp. 1886-1893, 1991.Google Scholar
{29} H. Kovar et al., "Overexpression of the Pseudoautosomal Gene MIC2 in Ewing's Sarcoma and Peripheral Primitive Neuroectodermal Tumor," Oncogene, vol. 45, pp. 1067-1070, 1990.Google Scholar
{30} S. Zhan, D.N. Shapiro, and L.J. Helman, "Activation of an Imprinted Allele of the Insulin-Like Growth Factor II Gene Implicated in Rhabdomyosarcoma," J. Clinical Investigation, vol. 94, pp. 445-448, 1994.Google ScholarCross Ref
{31} H. Hahn et al., "Pached Target Igf2 Is Indispensable for the Formation of Medulloblastoma and Rhabdomyosarcoma," J. Biological Chemistry, vol. 275, pp. 28 341-28 343, 2000.Google ScholarCross Ref

Index Terms

Accurate Cancer Classification Using Expressions of Very Few Genes
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning

Recommendations

An Approach Using Hybrid Methods to Select Informative Genes from Microarray Data for Cancer Classification
AMS '08: Proceedings of the 2008 Second Asia International Conference on Modelling & Simulation (AMS)

Recent advances in microarray technology allow scientists to measure expression levels of thousands of genes simultaneously in human tissue samples. This technology has been increasingly used in cancer research because of its potential for ...
Read More
Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage

Biomarker identification and cancer classification are two closely related problems. In gene expression data sets, the correlation between genes can be high when they share the same biological pathway. Moreover, the gene expression data sets may contain ...
Read More
Comparison of feature selection and classification combinations for cancer classification using microarray data

High throughput gene expression data can be used to identify biomarker profiles for classification. The accuracy of microarray based sample classification depends on the algorithm employed for selecting the features (genes) used for classification, and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4, Issue 1
January 2007
160 pages
ISSN:1545-5963
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
IEEE Computer Society Press
Washington, DC, United States
Publication History
- Published: 1 January 2007
Published in tcbb Volume 4, Issue 1
Author Tags
Cancer classification
fuzzy
gene expression
neural networks
support vector machines.
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 41
  Total Citations
  View Citations
- 692
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accurate Cancer Classification Using Expressions of Very Few Genes

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

An Approach Using Hybrid Methods to Select Informative Genes from Microarray Data for Cancer Classification

Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage

Comparison of feature selection and classification combinations for cancer classification using microarray data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Accurate Cancer Classification Using Expressions of Very Few Genes

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

An Approach Using Hybrid Methods to Select Informative Genes from Microarray Data for Cancer Classification

Biomarker Identification and Cancer Classification Based on Microarray Data Using Laplace Naive Bayes Model with Mean Shrinkage

Comparison of feature selection and classification combinations for cancer classification using microarray data

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media