skip to main content
article

Feature Selection for Gene Expression Using Model-Based Entropy

Published: 01 January 2010 Publication History

Abstract

Gene expression data usually contain a large number of genes but a small number of samples. Feature selection for gene expression data aims at finding a set of genes that best discriminate biological samples of different types. Using machine learning techniques, traditional gene selection based on empirical mutual information suffers the data sparseness issue due to the small number of samples. To overcome the sparseness issue, we propose a model-based approach to estimate the entropy of class variables on the model, instead of on the data themselves. Here, we use multivariate normal distributions to fit the data, because multivariate normal distributions have maximum entropy among all real-valued distributions with a specified mean and standard deviation and are widely used to approximate various distributions. Given that the data follow a multivariate normal distribution, since the conditional distribution of class variables given the selected features is a normal distribution, its entropy can be computed with the log-determinant of its covariance matrix. Because of the large number of genes, the computation of all possible log-determinants is not efficient. We propose several algorithms to largely reduce the computational cost. The experiments on seven gene data sets and the comparison with other five approaches show the accuracy of the multivariate Gaussian generative model for feature selection, and the efficiency of our algorithms.

References

[1]
A.A. Alizadeh, M.B. Eisen, R.E. David, C. Ma, I.S. Lossos, A.R. osenwald, H.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Martu, T. Moore, J. Hudson, L. Lu, D.B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, G.P. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botsten, P.O. Brown, and L.M. Staudt, "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, pp. 503- 511, 2000.
[2]
ALL, http://www.stjuderesearch.org/data/ALL1/, 2008.
[3]
C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.
[4]
C.-C. Chang and C.-J. Lin, "LIBSVM: A Library for Support Vector Machines," http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
[5]
M. Chee, R. Yang, E. Hubbell, A. Berno, X. Huang, D. Stern, J. Winkler, D. Lockhart, M. Morris, and S. Fodor, "Accessing Genetic Information with High Density DNA Arrays," Science, vol. 274, pp. 610-614, 1996.
[6]
T. Cover and J. Thomas, Elements of Information Theory. John Wiley & Sons, 1991.
[7]
T. Cover, "The Best Two Independent Measurements Are Not the Two Best," IEEE Trans. Systems, Man, and Cybernetics, vol. 4, pp. 116-117, 1974.
[8]
S. Dudoit, J. Fridlyand, and T.P. Speed, "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," J. Am. Statistical Assoc., vol. 97, no. 457, pp. 77-87, 2002.
[9]
V.V. Fedorov, Theory of Optimal Experiments. Academic Press, 1972.
[10]
S. Fodor, J. Read, M. Pirrung, L. Stryer, A. Lu, and D. Solas, "Light-Directed, Spatially Addressable Parallel Chemical Synthesis," Science, vol. 251, pp. 767-783, 1991.
[11]
HBC, http://www.columbia.edu/~xy56/project.htm, 2007.
[12]
I. Hedenfalk, D. Duggan, Y.C. Radmacher, M. Bittner, M. Simon, R. Meltzer, P. Gusterson, B. Esteller, M. Kallioniemi, B.W. Borg, and A. Trent, "Gene-Expression Profiles in Hereditary Breast Cancer," New England J. Medicine, vol. 344, no. 8, pp. 539-548, 2001.
[13]
E.T. Jaynes, "Information Theory and Statistical Mechanics," Physical Rev., vol. 106, no. 4, pp. 620-630, May 1957.
[14]
J. Khan, J. Wei, M. Ringner, L. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C.R. Antonescu, C. Peterson, and P. Meltzer, "Classification and Diagnostic Prediction of Cancers Using Expression Profiling and Artificial Neural Networks," Nature Medicine, vol. 7, no. 6, pp. 673-679, 2001.
[15]
J. Kiefer, "Optimum Experimental Designs," J. Royal Statistical Soc. B, vol. 21, pp. 272-319, 1959.
[16]
R. Kohavi and G.H. John, "Wrappers for Feature Subset Selection," Artificial Intelligence, vol. 97, no. 1-2, pp. 273-324, 1997.
[17]
P. Langley, "Selection of Relevant Features in Machine Learning," Proc. AAAI Fall Symp. Relevance, pp. 140-144, 1994.
[18]
T. Li, C. Zhang, and M. Ogihara, "A Comparative Study of Feature Selection and Multiclass Classification Methods for Tissue Classification Based on Gene Expression," Bioinformatics, vol. 20, no. 15, pp. 2429-2437, 2004.
[19]
Y. Li, C. Campbell, and M. Tipping, "Bayesian Automatic Relevance Determination Algorithms for Classifying Gene Expression Data," Bioinformatics, vol. 18, pp. 1332-1339, 2004.
[20]
LYM, http://genome-www.stanford.edu/lymphoma, 2008.
[21]
MLL, http://research.dfci.harvard.edu/korsmeyer/MLL.htm, 2008.
[22]
R. Marko and K. Igor, "Theoretical and Empirical Analysis of ReliefF and RReliefF," Machine Learning J., pp. 23-69, 2003.
[23]
NCI60, http://genome-www.stanford.edu/nci60/, 2008.
[24]
C. Ooi and P. Tan, "Genetic Algorithms Applied to Multi-Class Prediction for the Analysis of Gene Expression Data," Bioinformatics, vol. 19, pp. 37-44, 2003.
[25]
H. Peng, F. Long, and C. Ding, "Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy," IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 27, no. 8, pp. 1226-1238, Aug. 2005.
[26]
K.B. Petersen and M.S. Pedersen, The Matrix Cookbook, Version 20051003, 2006.
[27]
S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M. Angelo, C. Ladd, M. Reich, E. Latulippe, J.P. Mesirov, T. Poggio, W. Gerald, M. Loda, E.S. Lander, and T. R. Golub, "Multiclass Cancer Diagnosis Using Tumor Gene Expression Signatures," vol. 98, no. 26, pp. 15149-15154, 2001.
[28]
D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellmand, V. Iyer, S.S. Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J.C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers, J.N. Weinstein, D. Botstein, and M.P.O. Brown, "Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines," Nature Genetics, vol. 24, pp. 227-235, 2000.
[29]
SRBCT, http://research.nhgri.nih.gov/microarray/Supplement/, 2008.
[30]
C. Stein, "Estimation of a Covariance Matrix," Rietz Lecture, 39th IMS Ann. Meeting, 1975.
[31]
Y. Su, T.M. Murali, V. Pavlovic, and S. Kasif, "Rankgene: Identification of Diagnostic Genes Based on Expression Data," Bioinformatics, http://genomics10.bu.edu/yangsu/rankgene/, 2003.
[32]
E.P. Xing, M.I. Jordan, and R.M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 601-608, 2001.
[33]
E.-J. Yeoh, M.E. Ross, S.A. Shurtleff, W.K. williams, D. Patel, R. Mahrouz, F.G. Behm, S.C. Raimondi, M.V. Relling, A. Patel, C. Cheng, D. Campana, D. Wilkins, X. Zhou, J. Li, H. Liu, C.-H. Pui, W.E. Evans, C. Naeve, L. Wong, and J.R. Downing, "Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Lymphoblastic Leukemia by Gene Expression Profiling," Cancer Cell, vol. 1, no. 2, pp. 133-143, 2002.
[34]
K.Y. Yeung, C. Fraley, A. Murua, A.E. Raftery, and W.L. Ruzzo, "Model-Based Clustering and Data Transformations for Gene Expression Data," Bioinformatics, vol. 17, no. 10, pp. 977-987, 2001.
[35]
K. Yu, J. Bi, and V. Tresp, "Active Learning via Transductive Experimental Design," Proc. 23rd Int'l Conf. Machine Learning (ICML '06), pp. 1081-1088, 2006.
[36]
L. Yu, H. Liu, and V. Tresp, "Redundancy Based Feature Selection for Microarray Data," Proc. 10th Int'l Conf. Knowledge Discovery and Data Mining (KDD), 2004.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 7, Issue 1
January 2010
190 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2010
Published in TCBB Volume 7, Issue 1

Author Tags

  1. Feature selection
  2. entropy.
  3. multivariate Gaussian generative model

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)3
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Feature selection methods in microarray gene expression data: a systematic mapping studyNeural Computing and Applications10.1007/s00521-022-07661-z34:22(19675-19702)Online publication date: 1-Nov-2022
  • (2016)Supervised, Unsupervised, and Semi-Supervised Feature SelectionIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2015.247845413:5(971-989)Online publication date: 1-Sep-2016
  • (2016)Feature selection for neutral vector in EEG signal classificationNeurocomputing10.1016/j.neucom.2015.10.012174:PB(937-945)Online publication date: 22-Jan-2016
  • (2016)A novel features ranking metric with application to scalable visual and bioinformatics data classificationNeurocomputing10.1016/j.neucom.2014.12.123173:P2(346-354)Online publication date: 15-Jan-2016
  • (2016)Cost-sensitive feature selection using random forestKnowledge-Based Systems10.1016/j.knosys.2015.11.01095:C(1-11)Online publication date: 1-Mar-2016
  • (2015)Feature selection for clustering categorical data with an embedded modelling approachExpert Systems: The Journal of Knowledge Engineering10.1111/exsy.1208232:3(444-453)Online publication date: 1-Jun-2015
  • (2015)Graph-based local concept coordinate factorizationKnowledge and Information Systems10.1007/s10115-013-0715-x43:1(103-126)Online publication date: 1-Apr-2015
  • (2014)Double selection based semi-supervised clustering ensemble for tumor clustering from gene expression profilesIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.231599611:4(727-740)Online publication date: 1-Jul-2014
  • (2013)Comparative Document Summarization via Discriminative Sentence SelectionACM Transactions on Knowledge Discovery from Data10.1145/2435209.24352117:1(1-18)Online publication date: 1-Mar-2013
  • (2013)Selection of interdependent genes via dynamic relevance analysis for cancer diagnosisJournal of Biomedical Informatics10.1016/j.jbi.2012.10.00446:2(252-258)Online publication date: 1-Apr-2013
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media