|
ABSTRACT
We aim at finding the smallest set of genes that can ensure highly accurate classification of cancers from microarray data by using supervised machine learning algorithms. The significance of finding the minimum gene subsets is three-fold: 1) It greatly reduces the computational burden and "noise” arising from irrelevant genes. In the examples studied in this paper, finding the minimum gene subsets even allows for extraction of simple diagnostic rules which lead to accurate diagnosis without the need for any classifiers. 2) It simplifies gene expression tests to include only a very small number of genes rather than thousands of genes, which can bring down the cost for cancer testing significantly. 3) It calls for further investigation into the possible biological relationship between these small numbers of genes and cancer development and treatment. Our simple yet very effective method involves two steps. In the first step, we choose some important genes using a feature importance ranking scheme. In the second step, we test the classification capability of all simple combinations of those important genes by using a good classifier. For three "small” and "simple” data sets with two, three, and four cancer (sub)types, our approach obtained very high accuracy with only two or three genes. For a "large” and "complex” data set with 14 cancer types, we divided the whole problem into a group of binary classification problems and applied the 2--step approach to each of these binary classification problems. Through this "divide-and-conquer” approach, we obtained accuracy comparable to previously reported results but with only 28 genes rather than 16,063 genes. In general, our method can significantly reduce the number of genes required for highly reliable diagnosis.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
[1] M. Schena, D. Shalon, R.W. Davis, and P.O. Brown, "Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray," Science, vol. 270, pp. 467-470, 1995.
|
| |
2
|
[2] J.M. Khan et al., "Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, pp. 673-679, 2001.
|
| |
3
|
[3] J. Deutsch, "Evolutionary Algorithms for Finding Optimal Gene Sets in Microarray Prediction," Bioinformatics, vol. 19, pp. 45-52, 2003.
|
| |
4
|
[4] R. Tibshirani, T. Hastie, B. Narashiman, and G. Chu, "Diagnosis of Multiple Cancer Types by Shrunken Centroids of Gene Expression," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6567-6572, 2002.
|
| |
5
|
[5] A.A. Alizadeh et al., "Distinct Types of Diffuse Large b-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503-511, 2000.
|
| |
6
|
[6] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, "Class Predicition by Nearest Shrunken Centroids with Applications to DNA Microarrays," Statistical Science, vol. 18, pp. 104-117, 2003.
|
| |
7
|
[7] Y. Lee and C.K. Lee, "Classification of Multiple Cancer Types by Multicategory Support Vector Machines Using Gene Expression Data," Bioinformatics, vol. 19, pp. 1132-1139, 2003.
|
| |
8
|
[8] S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
|
| |
9
|
[9] X. Chen et al., "Gene Expression Patterns in Human Liver Cancers," Molecular Biology of the Cell, vol. 13, pp. 1929-1939, 2002.
|
| |
10
|
[10] C. Ambroise and G.J. McLachlan, "Selection Bias in Gene Extraction on the Basis of Microarray Gene-Expression Data," Proc. Nat'l Academy of Sciences USA, vol. 99, pp. 6562-6566, 2002.
|
| |
11
|
[11] J. Devore and R. Peck, Statistics: The Exploration and Analysis of Data, third ed. Duxbury Press, 1997.
|
| |
12
|
|
| |
13
|
|
| |
14
|
[14] Y. Frayman, L. Wang, and C. Wan, "Cold Rolling Mill Thickness Control Using the Cascade-Correlation Neural Network," Control and Cybernetics, vol. 31, pp. 327-342, 2002.
|
| |
15
|
|
| |
16
|
|
| |
17
|
[17] M.P. Brown, W.N. Grundy, D. Lin, N. Cristianini, C.W. Sugnet, T.S. Furey, M. Ares Jr., and D. Haussler, "Knowledge-Based Analysis of Microarray Gene Expression Data by Using Support Vector Machines," Proc. Nat'l Academy of Sciences USA, vol. 97, pp. 262-267, 2000.
|
| |
18
|
[18] O. Troyanskaya et al., "Missing Value Estimation Methods for DNA Microarrays," Bioinformatics, vol. 17, pp. 520-525, 2001.
|
| |
19
|
[19] M. West, C. Blanchette, H. Dressman, E. Huang, S. Ishida, R. Spang, H. Zuzan, J.A. Olson Jr., J.R. Marks, and J.R. Nevins, "Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 11 462-11 467, 2001.
|
| |
20
|
[20] E. Freyhult, P. Prusis, M. Lapinsh, J.E. Wikberg, V. Moulton, and M.G. Gustafsson, "Unbiased Descriptor and Parameter Selection Confirms the Potential of Proteochemometric Modelling," BMC Bioinformatics, vol. 6, no. 50, 2005.
|
| |
21
|
[21] S. Dudoit, M.J.V.D. Laan, S. Keles, A.M. Molinaro, S.E. Sinisi, and S.L. Teng, "Loss-Based Estimation with Cross-Validation: Application to Microarray Data Analysis and Motif Finding," Univ of California Berkeley Division of Biostatistics Working Paper Series, no. 137, 2003, http://www.bepress.com/ucbbiostat/paper137.
|
| |
22
|
[22] A. Barrier, M.J.V.D. Laan, and S. Dudoit, "Prognosis of Stage II Colon Cancer by Non-Neoplastic Mucosa Gene Expression Profiling," Univ. of California Berkeley Division of Biostatistics Working Paper Series, no. 179, 2003, http://www.bepress.com/ ucbbiostat/paper179.
|
| |
23
|
[23] C.C. Chang and C.J. Lin, "A Comparison of Methods for Multi-Class Support Vector Machines," IEEE Trans. Neural Network, vol. 13, pp. 415-425, 2002.
|
| |
24
|
[24] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald, M. Loda, E. Lander, and T. Golub, "Multiclass Cancer Diagnosis Using Tumor Gene Expression Signature," Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 15 149-15 154, 2000.
|
| |
25
|
|
| |
26
|
|
| |
27
|
[27] T. Mitchell, Machine Learning. McGraw-Hill, 1998.
|
| |
28
|
[28] I.M. Ambros, P.F. Ambros, S. Strehl, H. Kovar, H. Gadner, and M. Salzer-Kuntschik, "MIC2 Is a Specific Marker for Ewing's Sarcoma and Peripheral Primitive Neuraoectodermal Tumor. Evidence for a Common Histogenesis of Ewing's Sarcoma and Peripheral Primitive Neuroectodermal Tumors from MIC2 Expression and Specific Chromosome Aberration," Cancer, vol. 67, pp. 1886-1893, 1991.
|
| |
29
|
[29] H. Kovar et al., "Overexpression of the Pseudoautosomal Gene MIC2 in Ewing's Sarcoma and Peripheral Primitive Neuroectodermal Tumor," Oncogene, vol. 45, pp. 1067-1070, 1990.
|
| |
30
|
[30] S. Zhan, D.N. Shapiro, and L.J. Helman, "Activation of an Imprinted Allele of the Insulin-Like Growth Factor II Gene Implicated in Rhabdomyosarcoma," J. Clinical Investigation, vol. 94, pp. 445-448, 1994.
|
| |
31
|
[31] H. Hahn et al., "Pached Target Igf2 Is Indispensable for the Formation of Medulloblastoma and Rhabdomyosarcoma," J. Biological Chemistry, vol. 275, pp. 28 341-28 343, 2000.
|
|