skip to main content
article

Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue Coclustering

Published: 01 July 2008 Publication History

Abstract

It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes as well as macroscopic phenotypes of related samples. In order to simultaneously cluster genes and conditions, we have previously developed a fast co-clustering algorithm, Minimum Sum-Squared Residue Co-clustering (MSSRCC), which employs an alternating minimization scheme and generates what we call co-clusters in a checkerboard structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression datasets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing co-clustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting co-clusters in a checkerboard structure, where genes in a co-cluster manifest the phenotype structure of corresponding specific samples, and evaluate the enrichment of functional annotations in Gene Ontology (GO).

References

[1]
J.L. DeRisi, V.R. Iyer, and P.O. Brown, "Exploring the Metabolic and Genetic Control of Gene Expression on a Genomic Scale," Science, vol. 278, no. 5338, pp. 680-686, 1997.
[2]
P.F. Macgregor and J.A. Squire, "Application of Microarrays to the Analysis of Gene Expression in Cancer," Clinical Chemistry, vol. 48, no. 8, pp. 1170-1177, 2002.
[3]
D.K. Slonim, "From Patterns to Pathways: Gene Expression Data Analysis Comes of Age," Nature Genetics Supplement, vol. 32, pp. 502-508, 2002.
[4]
M. Schena, Microarray Analysis. John Wiley & Sons, 2003.
[5]
M.F. Ochs and A.K. Godwin, "Microarrays in Cancer: Research and Applications," BioTechniques, vol. 34, pp. S4-S15, 2003.
[6]
D. Jiang, C. Tang, and A. Zhang, "Cluster Analysis for Gene Expression Data: A Survey," IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004.
[7]
R. Shamir and R. Sharan, "Algorithmic Approaches to Clustering Gene Expression Data," Current Topics in Computational Molecular Biology, pp. 269-299, MIT Press, 2002.
[8]
M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, "Cluster Analysis and Display of Genome-Wide xpression Patterns," Proc. Nat'l Academy of Science, vol. 95, no. 25, pp. 14 863-14 868, 1998.
[9]
T.R. Golub et al., "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring," Science, vol. 286, pp. 531-537, 1999.
[10]
A.A. Alizadeh et al., "Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling," Nature, vol. 403, no. 6769, pp. 503-511, 2000.
[11]
A. Ben-Dor, B. Chor, R.M. Karp, and Z. Yakhini, "Discovering Local Structure in Gene Expression Data: The Order-Preserving Submatrix Problem," J. Computational Biology, vol. 10, nos. 3-4, pp. 373-384, 2003.
[12]
H. Cho, I.S. Dhillon, Y. Guan, and S. Sra, "Minimum Sum-Squared Residue Based Co-clustering of Gene Expression Data," Proc. Fourth SIAM Int'l Conf. Data Mining (SDM '04), pp. 114-125, 2004.
[13]
J.A. Hartigan, "Direct Clustering of a Data Matrix," J. Am. Statistical Assoc., vol. 67, no. 337, pp. 123-129, 1972.
[14]
Y. Cheng and G.M. Church, "Biclustering of Expression Data," Proc. Eighth Int'l Conf. Intelligent Systems for Molecular Biology (ISMB '00), vol. 8, pp. 93-103, 2000.
[15]
O.E. Livne and G.H. Golub, "Scaling by Binormalization," Numerical Algorithms, vol. 35, no. 1, pp. 97-120, 2004.
[16]
I.S. Dhillon, Y. Guan, and J. Kogan, "Iterative Clustering of High Dimensional Text Data Augmented by Local Search," Proc. Second IEEE Int'l Conf. Data Mining (ICDM), 2002.
[17]
T.G.O. Consortium, "Gene Ontology: Tool for the Unification of Biology," Nature Genetics, vol. 25, pp. 25-29, 2000.
[18]
G. Dennis Jr. et al., "DAVID: Database for Annotation, Visualization, and Integrated Discovery," Genome Biology, vol. 4, no. R60, 2003.
[19]
I.V. Mechelen, H. Bock, and P.D. Boeck, "Two-Mode Clustering Methods: A Structured Overview," Statistical Methods in Medical Research, vol. 13, pp. 363-394, 2004.
[20]
S.C. Madeira and A.L. Oliveira, "Biclustering Algorithms for Biological Data Analysis: A Survey," IEEE Trans. Computational Biology and Bioinformatics, vol. 1, no. 1, pp. 24-45, Jan.-Mar. 2004.
[21]
B. Mirkin, Mathematical Classification and Clustering. Kluwer Academic Publishers, 1996.
[22]
I. Csiszár and G. Tusnády, "Information Geometry and Alternating Minimization Procedure," Statistics and Decisions, supplemental issue, vol. 1, pp. 205-237, 1984.
[23]
W. Gaul and M. Schader, "A New Algorithm for Two-Mode Clustering," Data Analysis and Information Systems, H. Hermann and W. Polasek, eds., pp. 15-23, Springer, 1996.
[24]
D. Baier, W. Gaul, and M. Schader, "Two-Mode Overlapping Clustering with Applications to Simultaneous Benefit Segmentation and Market Structuring," Classification and Knowledge Organization: Recent Advances and Applications, R. Klar and O. Opitz, eds., pp. 557-566, Springer, 1997.
[25]
V. Maurizio, "Double k-means Clustering for Simultaneous Classification of Objects and Variables," Advances in Classification and Data Analysis, S. Borra, R. Rocci, M. Vichi, and M. Schader, eds., pp. 43-52, Springer, 2001.
[26]
J. Yang, H. Wang, W. Wang, and P. Yu, "Enhanced Biclustering on Expression Data," Proc. Third IEEE Symp. Bioinformatics and BioEngineering (BIBE '03), pp. 321-327, 2003.
[27]
J. Yang, W. Wang, H. Wang, and P. Yu, "¿-Clusters: Capturing Subspace Correlation in a Large Data Set," Proc. 18th IEEE Int'l Conf. Data Eng. (ICDE '02), pp. 517-528, 2002.
[28]
Y. Kluger, R. Basri, J.T. Chang, and M. Gerstein, "Spectral Biclustering of Microarray Data: Coclustering of Genes and Conditions," Genome Research, vol. 13, no. 4, pp. 703-716, 2003.
[29]
I.S. Dhillon, "Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning," Proc. Seventh ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '01), pp. 269- 274, 2001.
[30]
I.S. Dhillon, S. Mallela, and D.S. Modha, "Information-Theoretic Co-Clustering," Proc. Ninth ACM Int'l Conf. Knowledge Discovery and Data Mining (SIGKDD '03), pp. 89-98, 2003.
[31]
A. Banerjee, I.S. Dhillon, J. Ghosh, S. Merugu, and D.S. Modha, "A Generalized Maximum Entropy Approach to Bregman Co-Clustering and Matrix Approximation," J. Machine Learning Research, vol. 8, pp. 1919-1986, 2007.
[32]
S. Bleuler, A. Prelic, and E. Zitzler, "An EA Framework for Biclustering of Gene Expression Data," Proc. Sixth Congress on Evolutionary Computation (CEC '04), pp. 166-173, 2004.
[33]
T.H. Bø and I. Jonassen, "New Feature Subset Selection Procedures for Classification of Expression Profiles," Genome Biology, vol. 3, no. 4, 2002.
[34]
M. Dettling and P. Bühlmann, "Supervised Clustering of Genes," Genome Biology, vol. 3, no. 12, 2002.
[35]
S. Dudoit and J. Fridlyand, "A Prediction-Based Resampling Method for Estimating the Number of Clusters in a Dataset," Genome Biology, vol. 3, no. 7, pp. 0036.1-0036.21, 2002.
[36]
F.C. Sánchez, P.J. Lewi, and D.L. Massart, "Effect of Different Preprocessing Methods for Principal Component Analysis Applied to the Composition of Mixtures: Detection of Impurities in HPLC-DAD," Chemometrics and Intelligent Laboratory Systems, vol. 25, no. 2, pp. 157-177, 1994.
[37]
L. Wouters et al., "Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods," Biometrics, vol. 59, pp. 1131-1139, 2003.
[38]
B.R. Kowalski and C.F. Bender, "Pattern Recognition: A Powerful Approach to Interpreting Chemical Data," J. Am. Chemical Soc., vol. 94, no. 16, pp. 5632-5639, 1972.
[39]
R.A. Harshman and M.E. Lundy, "Data Preprocessing and the Extended PARAFAC Model," Research Methods for Multimode Data Analysis, pp. 216-284, Praeger, 1984.
[40]
R. Bro and A.K. Smilde, "Centering and Scaling in Component Analysis," J. Chemometrics, vol. 17, pp. 16-33, 2003.
[41]
A. Smilde, R. Bro, and P. Geladi, "Preprocessing" Multi-Way Analysis with Applications in the Chemical Sciences, pp. 221-255, John Wiley & Sons, 2004.
[42]
D.S. Johnson, "The NP-Completeness Column: An Ongoing Guide," J. Algorithms, vol. 8, no. 3, pp. 438-448, 1987.
[43]
C. Eckart and G. Young, "The Approximation of One Matrix by Another of Lower Rank," Psychometrika, vol. 1, pp. 211-218, 1936.
[44]
S.X. Yu and J. Shi, "Multiclass Spectral Clustering," Proc. Ninth IEEE Int'l Conf. Computer Vision, 2003.
[45]
A. Prelic et al., "A Systematic Comparison and Evaluation of Biclustering Methods for Gene Expression Data," Bioinformatics, vol. 22, no. 9, pp. 1122-1129, 2006.
[46]
U. Alon et al., "Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays," Proc. Nat'l Academy of Science, vol. 96, no. 12, pp. 6745-6750, 1999.
[47]
G.J. Gordon et al., "Translation of Microarray Data into Clinically Relevant Cancer Diagnostic Tests Using Gene Expression Ratios in Lung Cancer and Mesothelioma," Cancer Research, vol. 62, pp. 4963-4967, 2002.
[48]
S.A. Armstrong et al., "MLL Translocations Specify a Distinct Gene Expression Profile that Distinguishes a Unique Leukemia," Nature Genetics, vol. 30, pp. 41-47, 2002.
[49]
P.J. Lewi, "Spectral Map Analysis: Factorial Analysis of Contrast, Especially from Log Ratios," Chemometrics and Intelligent Laboratory Systems, vol. 5, no. 2, pp. 105-116, 1989.
[50]
M. Xiong, W. Li, J. Zhao, L. Jin, and E. Boerwinkle, "Feature (Gene) Selection in Gene Expression-Based Tumor Classification," Molecular Genetics and Metabolism, vol. 73, pp. 239-247, 2001.
[51]
E. Yeoh et al., "Classification, Subtype Discovery, and Prediction of Outcome in Pediatric Acute Lymphoblastic Leukemia by Gene Expression Profiling," Cancer Cell, vol. 1, no. 2, pp. 133-143, 2002.
[52]
I.G. Costa, F.A. de Carvalho, and M.C. de Souto, "Comparative Analysis of Clustering Methods for Gene Expression Time Course Data," Genetics and Molecular Biology, vol. 27, no. 4, pp. 623-631, 2004.
[53]
S. Datta and S. Datta, "Comparisons and Validation of Statistical Clustering Techniques for Microarray Gene Expression Data," Bioinformatics, vol. 19, pp. 459-466, 2003.
[54]
F.D. Gibbons and F.P. Roth, "Judging the Quality of Gene Expression-Based Clustering Methods Using Gene Annotation," Genome Research, vol. 12, pp. 1574-1581, 2002.
[55]
J. Chen, X. He, and L. Li, "Identifying the Patterns of Hematopoietic Stem Cells Gene Expressions Using Clustering Methods: Comparison and Summary," J. Data Science, vol. 2, pp. 297-379, 2004.
[56]
K.Y. Yeung, D.R. Haynor, and W.L. Ruzzo, "Validating Clustering for Gene Expression Data," Bioinformatics, vol. 17, no. 4, pp. 309- 318, 2001.
[57]
J. Handl, J. Knowles, and D.B. Kell, "Computational Cluster Validation in Post-Genomic Data Analysis," Bioinformatics, vol. 21, no. 15, pp. 3201-3212, 2005.
[58]
M.L. Chow, E.J. Moler, and I.S. Mian, "Identifying Marker Genes in Transcription Profiling Data Using a Mixture of Feature Relevance Experts," Physiological Genomics, vol. 5, pp. 99-111, 2001.
[59]
G. Getz, E. Levine, and E. Domany, "Coupled Two-Way Clustering Analysis of Gene Microarray Data," Proc. Nat'l Academy of Science, vol. 97, no. 22, pp. 12 079-12 084, 2000.
[60]
X. Qiu et al., "Human Epithelial Cancers Secrete Immunoglobulin G with Unidentified Specificity to Promote Growth and Survival of Tumor Cells," Cancer Research, vol. 63, pp. 6488-6495, 2003.
[61]
N. Tsai, B. Chen, S. Wei, C. Wu, and S.R. Roffler, "Anti-Tumor Immunoglobulin M Increases Lung Metastasis in an Experimental Model of Malignant Melanoma," Clinical and Experimental Metastasis , vol. 20, pp. 103-109, 2003.
[62]
T.J. Giordano et al., "Organ-Specific Molecular Classification of Primary Lung, Colon, and Ovarian Adenocarcinomas Using Gene Expression Profiles," Am. J. Pathology, vol. 159, no. 4, pp. 1231- 1238, 2001.
[63]
M. Nacht et al., "Molecular Characteristics of Non-Small Cell Lung Cancer," Proc. Nat'l Academy of Science, vol. 98, no. 26, pp. 15 203-15 208, 2001.
[64]
M.Z. Man, X. Wang, and Y. Wang, "POWER_SAGE: Comparing Statistical Tests for Sage Experiments," Bioinformatics, vol. 16, no. 11, pp. 953-959, 2000.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 5, Issue 3
July 2008
159 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 July 2008
Published in TCBB Volume 5, Issue 3

Author Tags

  1. Gene Ontology
  2. binormalization
  3. co-clustering
  4. deterministic spectral initialization
  5. local search
  6. microarray analysis

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Survey on Model-Based Co-Clustering: High Dimension and Estimation ChallengesJournal of Classification10.1007/s00357-023-09441-340:2(332-381)Online publication date: 17-Jul-2023
  • (2021)Regularized bi-directional co-clusteringStatistics and Computing10.1007/s11222-021-10006-w31:3Online publication date: 1-May-2021
  • (2018)An Improved Kernel Credal Classification Algorithm Based on Regularized Mahalanobis DistanceComputational Intelligence and Neuroscience10.1155/2018/75257862018Online publication date: 27-Jun-2018
  • (2018)Mutual information, phi-squared and model-based co-clustering for contingency tablesAdvances in Data Analysis and Classification10.1007/s11634-016-0274-612:3(455-488)Online publication date: 1-Sep-2018
  • (2017)Model-based co-clustering for the effective handling of sparse dataPattern Recognition10.1016/j.patcog.2017.06.00572:C(108-122)Online publication date: 1-Dec-2017
  • (2016)Hard and fuzzy diagonal co-clustering for document-term partitioningNeurocomputing10.1016/j.neucom.2016.02.003193:C(133-147)Online publication date: 12-Jun-2016
  • (2016)Graph modularity maximization as an effective method for co-clustering text dataKnowledge-Based Systems10.1016/j.knosys.2016.07.002109:C(160-173)Online publication date: 1-Oct-2016
  • (2015)Fuzzy Clustering Systems in Analyzing High Dimensional DatabaseProceedings of the ASE BigData & SocialInformatics 201510.1145/2818869.2818879(1-4)Online publication date: 7-Oct-2015
  • (2015)Adaptive fuzzy consensus clustering framework for clustering analysis of cancer dataIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.235943312:4(887-901)Online publication date: 1-Jul-2015
  • (2015)Non-negative Matrix Tri-Factorization for co-clusteringInformation Sciences: an International Journal10.1016/j.ins.2014.12.058301:C(13-26)Online publication date: 20-Apr-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media