skip to main content
article

Data-Dependent Kernel Machines for Microarray Data Classification

Published: 01 October 2007 Publication History

Abstract

One important application of gene expression analysis is to classify tissue samples according to their gene expression levels. Gene expression data are typically characterized by high dimensionality and small sample size, which makes the classification task quite challenging. In this paper, we present a data-dependent kernel for microarray data classification. This kernel function is engineered so that the class separability of the training data is maximized. A bootstrapping-based resampling scheme is introduced to reduce the possible training bias. The effectiveness of this adaptive kernel for microarray data classification is illustrated with a k-Nearest Neighbor (KNN) classifier. Our experimental study shows that the data-dependent kernel leads to a significant improvement in the accuracy of KNN classifiers. Furthermore, this kernel-based KNN scheme has been demonstrated to be competitive to, if not better than, more sophisticated classifiers such as Support Vector Machines (SVMs) and the Uncorrelated Linear Discriminant Analysis (ULDA) for classifying gene expression data.

References

[1]
A. Schulze and J. Downward, “Navigating Gene Expression Using Microarrays—A Technology Review,” Natural Cell Biology, vol. 3, no. 8, pp. E190-195, 2001.
[2]
E. Keedwell and A. Narayanan, “Discovering Gene Networks with a Neural-Genetic Hybrid,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 2, no. 3, pp. 231-242, July-Sept. 2005.
[3]
D. Singh, P.G. Febbo, K. Ross, D.G. Jackson, J. Manola, C. Ladd, P. Tamayo, A.A. Renshaw, A.V. D'Amico, J.P. Richie, E.S. Lander, M. Loda, P.W. Kantoff, T.R. Golub, and W.R. Sellers, “Gene Expression Correlations of Clinical Prostate Cancer Behavior,” Cancer Cell, vol. 1, pp. 203-209, 2004.
[4]
L.J. van't Veer et al., “Gene Expression Profiling Predicts Clinical Outcome of Breast Cancer,” Nature, vol. 419, pp. 530-536, 2002.
[5]
K.M. Borgwardt, S.V.N. Vishwanathan, and H. Kriegel, “Class Prediction from Time Series Gene Expression Profiles Using Dynamical Systems Kernels,” Proc. Pacific Symp. Biocomputing, vol. 11, pp. 547-558, 2006.
[6]
M. Wilson, J. DeRisi, H.H. Kristensen, P. Imboden, S. Rane, P.O. Brown, and G.K. Schoolnik, “Exploring Drug-Induced Alterations in Gene Expression in Mycobacterium Tuberculosis by Microarray Hybridization,” Proc Nat'l Academy of Sciences USA, vol. 96, no. 22, pp. 12833-12838, 1999.
[7]
W.E. Evans and R.K. Guy, “Gene Expression as a Drug Discovery Tool,” Nature Genetics, vol. 36, no. 3, pp. 214-215, 2004.
[8]
R. Sharan and R. Shamir, “Algorithmic Approaches to Clustering Gene Expression Data,” Current Topics in Computational Molecular Biology, pp. 269-300, MIT Press, 2002.
[9]
M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster Analysis and Display of Genome-Wide Expression Patterns,” Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[10]
P. Toronen, M. Kolehmainen, G. Wong, and E. Castren, “Analysis of Gene Expression Data Using Self-Organizing Maps,” FEBS Letters, vol. 451, pp. 142-146, 1999.
[11]
S. Tavazoie, J.D. Hughes, M.J. Campbell, R.J. Cho, and G.M. Church, “Systematic Determination of Genetic Network Architecture,” Nature Genetics, vol. 22, pp. 281-285, 1999.
[12]
P. Langley, “Selection of Relevant Features in Machine Learning,” Proc. AAAI Fall Symp. Relevance, 1994.
[13]
R. Kohavi and G. John, “Wrapper for Feature Subset Selection,” Artificial Intelligence, vol. 97, pp. 273-324, 1997.
[14]
E.P. Xing, M.I. Jordan, and R.M. Karp, “Feature Selection for High-Dimensional Genomic Microarray Data,” Proc. 18th Int'l Conf. Machine Learning (ICML), 2001.
[15]
Kernel Methods in Computational Biology, B. Scholkopf, K. Tsuda, and J.-P. Vert, eds. MIT Press, 2004.
[16]
A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini, “Tissue Classification with Gene Expression Profiles,” J. Computational Biology, vol. 7, pp. 559-584, 2000.
[17]
S. Dudoit, J. Fridlyand, and T.P. Speed, “Comparison of Discrimination Method for the Classification of Tumor Using Gene Expression Data,” J. Am. Statistical Assoc., vol. 97, pp. 77-87, 2002.
[18]
M. Dettling and P. Bühlmann, “Boosting for Tumor Classification with Gene Expression Data,” Bioinformatics, vol. 19, pp. 1061-1069, 2003.
[19]
T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gassenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, pp. 531-537, 1999.
[20]
B. West et al., “Predicting the Clinical Status of Human Breast Cancer by Using Gene Expression Profiles,” Proc. Nat'l Academy of Sciences USA, vol. 98, pp. 11462-11467, 2001.
[21]
J. Ye, T. Li, T. Xiong, and R. Janardan, “Using Uncorrelated Discriminant Analysis for Tissue Classification with Gene Expression Data,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 1, no. 4, pp. 181-190, Oct.-Dec. 2004.
[22]
T. Hastie and R. Tibshirani, “Discriminant Adaptive Nearest Neighbor Classification,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 18, pp. 607-615, 1996.
[23]
R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, second ed. Wiley-Interscience, 2000.
[24]
J.H. Friedman, “Flexible Metric Nearest Neighbor Classification,” technical report, Dept. of Statistics, Stanford Univ., 1994.
[25]
P. Howland and H. Park, “Generalizing Discriminant Analysis Using the Generalized Singular Value Decomposition,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, pp. 995-1006, 2004.
[26]
T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler, “Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data,” Bioinformatics, vol. 16, pp. 906-914, 2000.
[27]
T. Jaakkola, M. Diekhans, and D. Haussler, “Using the Fisher Kernel Method to Detect Remote Protein Homologies,” Proc. Seventh Int'l Conf. Intelligent Systems for Molecular Biology, 1999.
[28]
A. Zien, G. Rätsch, S. Mika, B. Schölkopf, C. Lemmen, A. Smola, T. Lengauer, and K. Müller, “Engineering Support Vector Machine Kernels that Recognize Translation Initiation Sites,” Bioinformatics, vol. 16, pp. 799-807, 2000.
[29]
P. Pavlidis, T.S. Furey, M. Liberto, and W.N. Grundy, “Promoter Region-Based Classification of Genes,” Proc. Pacific Symp. Biocomputing, pp. 151-163, 2001.
[30]
J.-P. Vert, “A Tree Kernel to Analyze Phylogenetic Profiles,” Bioinformatics, vol. 18, pp. S276-S284, 2002.
[31]
S. Hua and Z. Sun, “Support Vector Machine Approach for Protein Subcellular Localization Prediction,” Bioinformatics, vol. 17, no. 8, pp. 721-728, 2001.
[32]
S. Degroeve, B. De Baets, Y. Van de Peer, and P. Rouz, “Feature Subset Selection for Splice Site Prediction,” Bioinformatics, vol. 18, pp. S75-S83, 2002.
[33]
J.-P. Vert, “Support Vector Machine Prediction of Signal Peptide Cleavage Site Using a New Class of Kernels for Strings,” Proc. Pacific Symp. Biocomputing, pp. 649-660, 2002.
[34]
R.J. Carter, I. Dubchak, and S.R. Holbrook, “A Computational Approach to Identify Genes for Functional RNAs in Genomic Sequences,” Nucleic Acids Research, vol. 29, no. 19, pp. 3928-3938, 2001.
[35]
S. Hua and Z. Sun, “A Novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: Support Vector Machine Approach,” J. Molecular Biology, vol. 308, pp. 397-407, 2001.
[36]
J.R. Bock and D.A. Gough, “Predicting Protein-Protein Interactions from Primary Structure,” Bioinformatics, vol. 17, pp. 455-460, 2001.
[37]
G.C. Cawley, MATLAB Support Vector Machine Toolbox, School of Information Systems, Univ. of East Anglia, http://theoval.sys. uea.ac.uk/~gcc/svm/ toolbox, Norwich, U.K., 2000.
[38]
S. Amari and S. Wu, “Improving Support Vector Machine Classifiers by Modifying Kernel Functions,” Neural Networks, vol. 12, pp. 783-789, 1999.
[39]
H. Xiong, M.N.S. Swamy, and M.O. Ahmad, “Optimizing the Data-Dependent Kernel in the Empirical Feature Space,” IEEE Trans. Neural Networks, vol. 16, pp. 460-474, 2005.
[40]
Y. Raviv and N. Intrator, “Bootstrapping with Noise: An Efficient Regularization Technique,” Connection Science, vol. 8, pp. 355-372, 1996.
[41]
S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, L.M. Sturla, M. Angelo, M.E. McLaughlin, J.Y.H. Kim, L.C. Goumnerova, P.M. Black, C. Lau, J.C. Allen, D. Zagzag, J.M. Olson, T. Curran, C. Wetmore, J.A. Biegel, T. Poggio, S. Mukherjee, R. Rifkin, A. Califano, G. Stolovitzky, D.N. Louis, J.P. Mesirov, E.S. Lander, and T.R. Golub, “Prediction of Central Nervous System Embryonal Tumor Outcome Based on Gene Expression,” Nature, vol. 415, pp. 436-442, 2002.
[42]
U. Alon, N. Barkai, D.A. Notterman, K. Gish, S. Ybarra, D. Mack, and A.J. Levine, “Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissue Probed by Oligonucleotide Arrays,” Proc. Nat'l Academy of Sciences USA, vol. 96, pp. 6745-6750, 1999.
[43]
G.J. Gordon, R.V. Jenson, L.-L. Hsiao, S.R. Gullans, J.E. Blumenstock, S. Ramaswamy, W.G. Richards, D.J. Sugarbaker, and R. Bueno, “Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelima,” Cancer Research, vol. 62, pp. 4936-4967, 2002.
[44]
M.A. Shipp, K.N. Ross, P. Tamayo, A.P. Weng, J.L. Kutok, R.C.T. Aguiar, M. Gaasenbeek, M. Angelo, M. Reich, G.S. Pinkus, T.S. Ray, M.A. Koval, K.W. Last, A. Norton, T.A. Lister, J. Mesirov, D.S. Neuberg, E.S. Lander, J.C. Aster, and T.R. Golub, “Diffuse Large B-Cell Lymphoma Outcome Prediction by Gene Expression Profiling and Supervised Machine Learning,” Nature Medicine, vol. 8, pp. 68-74, 2002.
[45]
E.F. Petricoin, A.M. Ardekanl, B.A. Hitt, P.J. Levine, V.A. Fusaro, S.M. Steinberg, G.B. Mills, C. Simone, D.A. Fishman, E.C. Kohn, and L.A. Liotta, “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,” The Lancet, vol. 359, pp. 572-577, 2002.
[46]
D.W. Wichern and R.A. Johnson, Applied Multivariate Statistical Analysis, fifth ed. Prentice-Hall, 2002.
[47]
E. Pekalska, P. Paclik, and R.P.W. Duin, “A Generalized Kernel Approach to Dissimilarity-Based Classification,” J. Machine Learning Research, vol. 2, pp. 175-211, 2001.
[48]
C. Leslie and R. Kuang, “Fast String Kernels Using Inexact Matching for Protein Sequences,” J. Machine Learning Research, vol. 5, pp. 1435-1455, 2004.

Cited By

View all
  • (2019)Improvement of the target selection process in transcriptomics dataProceedings of the New Challenges in Data Sciences: Acts of the Second Conference of the Moroccan Classification Society10.1145/3314074.3314090(1-4)Online publication date: 28-Mar-2019
  • (2019)Lowest probability mass neighbour algorithmsMachine Language10.1007/s10994-018-5737-x108:2(331-376)Online publication date: 1-Feb-2019
  • (2018)Data-dependent kernel sparsity preserving projection and its application for semi-supervised classificationMultimedia Tools and Applications10.1007/s11042-018-5707-077:18(24459-24475)Online publication date: 1-Sep-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 4, Issue 4
October 2007
192 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 October 2007
Published in TCBB Volume 4, Issue 4

Author Tags

  1. Microarray data analysis
  2. bootstrapping resampling
  3. cancer classification
  4. kernel machines
  5. kernel optimization

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2019)Improvement of the target selection process in transcriptomics dataProceedings of the New Challenges in Data Sciences: Acts of the Second Conference of the Moroccan Classification Society10.1145/3314074.3314090(1-4)Online publication date: 28-Mar-2019
  • (2019)Lowest probability mass neighbour algorithmsMachine Language10.1007/s10994-018-5737-x108:2(331-376)Online publication date: 1-Feb-2019
  • (2018)Data-dependent kernel sparsity preserving projection and its application for semi-supervised classificationMultimedia Tools and Applications10.1007/s11042-018-5707-077:18(24459-24475)Online publication date: 1-Sep-2018
  • (2018)Supervised data-dependent kernel sparsity preserving projection for image recognitionApplied Intelligence10.1007/s10489-018-1249-448:12(4923-4936)Online publication date: 1-Dec-2018
  • (2015)Smart Colonography for Distributed Medical Databases with Group Kernel Feature AnalysisACM Transactions on Intelligent Systems and Technology10.1145/26681366:4(1-24)Online publication date: 27-Jul-2015
  • (2015)Multiple data-dependent kernel for classification of hyperspectral imagesExpert Systems with Applications: An International Journal10.1016/j.eswa.2014.09.00442:3(1118-1135)Online publication date: 15-Feb-2015
  • (2013)A similarity based learning framework for interim analysis of outcome prediction of acupuncture for neck painInternational Journal of Data Mining and Bioinformatics10.1504/IJDMB.2013.0566438:4(381-395)Online publication date: 1-Sep-2013
  • (2011)Cancer Classification from Gene Expression Data by NPPC EnsembleIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2010.368:3(659-671)Online publication date: 1-May-2011
  • (2010)Noise reduction of cDNA microarray images using complex waveletsIEEE Transactions on Image Processing10.1109/TIP.2010.204569119:8(1953-1967)Online publication date: 1-Aug-2010

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media