article

Ortholog Clustering on a Multipartite Graph

Authors:

Akshay Vashist,

Casimir A. Kulikowski,

Ilya MuchnikAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 4, Issue 1

Pages 17 - 27

https://doi.org/10.1109/TCBB.2007.1004

Published: 01 January 2007 Publication History

Abstract

We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E|+|V| log|V|), where V and E are the sets of vertices and edges, respectively, in the graph. However, if we discretize the similarity scores into a constant number of bins, the runtime improves to O(|E|+|V|). The proposed method was applied to seven complete eukaryote genomes on which the manually curated database of eukaryotic ortholog clusters, KOG, is constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters.

References

[1]

{1} W.M. Fitch, "Distinguishing Homologous from Analogous Proteins," Systematic Zoology, vol. 19, pp. 99-113, 1970.

[2]

{2} E.V. Koonin, "An Apology for Orthologs--Or Brave New Memes," Genome Biology, vol. 2, no. 4, 2001.

[3]

{3} V. Solovyev and I. Shahmuradov, "PromH: Promoters Identification Using Identification Orthologous Genomic Sequences," Nucleic Acids Research, vol. 31, no. 13, pp. 3540-3545, 2003.

[4]

{4} B. Lemos, B.R. Bettencourt, C.D. Meiklejohn, and D.L. Hartl, "Evolution of Proteins and Gene Expression Levels Are Coupled in Drosophila and Are Independently Associated with mRNA Abundance, Protein Length, and Number of Protein-Protein Interactions," Molecular Biology Evolution, vol. 22, pp. 1345-1354, 2005.

[5]

{5} W. Fujibuchi, H. Ogata, H. Matsuda, and M. Kanehisa, "Automatic Detection of Conserved Gene Clusters in Multiple Genomes by Graph Comparison and p-Quasi Grouping," Nucleic Acids Research, vol. 28, no. 2, pp. 4096-4036, 2002.

[6]

{6} M. Kamvysselis, N. Patterson, B. Birren, B. Berger, and E. Lander, "Whole-Genome Comparative Annotation and Regulatory Motif Discovery in Multiple Yeast Species," Proc. Int'l Conf. Research in Computational Molecular Biology (RECOMB), pp. 157-166, 2003.

Digital Library

[7]

{7} M. Remm, C.E. Strom, and E.L. Sonnhammer, "Automatic Clustering of Orthologs and In-Paralogs from Pairwise Species Comparisons," J. Molecular Biology, vol. 314, pp. 1041-1052, 2001.

[8]

{8} R. Tatusov, N. Fedorova, J. Jackson, A. Jacobs, B. Kiryutin, E. Koonin, D. Krylov, R.M.R.S. Mekhedov, A. Nikolskaya, B. Rao, S. Smirnov, A. Sverdlov, S. Vasudevan, Y. Wolf, J. Yin, and D. Natale, "The cOG Database: An Updated Version Includes Eukaryotes," BioMed Central Bioinformatics, 2003.

[9]

{9} R.L. Tatusov, E.V. Koonin, and D.J. Lipman, "A Genomic Perspective on Protein Families," Science, vol. 278, pp. 631-637, 1997.

[10]

{10} C.M. Zmasek and S.R. Eddy, "RIO: Analyzing Proteomes by Automated Phylogenomics Using Resampled Inference of Orthologs," BMC Bioinformatics, vol. 3, no. 14, 2002.

[11]

{11} M.A. Huynen and P. Bork, "Measuring Genome Evolution," Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 5849-5856, 1998.

[12]

{12} J. Tang and B. Moret, "Phylogenetic Reconstruction from Gene Rearrangement Data with Unequal Gene Content," Proc. Eighth Workshop Algorithms and Data Structures (WADS'03), pp. 37-46, 2003.

[13]

{13} M. Dawande, P. Keskinocak, J.M. Swaminathan, and S. Tayur, "On Bipartite and Multipartite Clique Problems," J. Algorithms, vol. 41, pp. 388-403, 2001.

Digital Library

[14]

{14} J. Pei, D. Jiang, and A. Zhang, "On Mining Cross Graph Quasi-Cliques," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD'05), 2005.

Digital Library

[15]

{15} J. Abello, M.G. Resende, and S. Sudarsky, "Massive Quasi-Clique Detection," Proc. Latin Am. Theoretical Informatics Symp. (LATIN '02), 2002.

Digital Library

[16]

{16} D.W. Matula and L.L. Beck, "Smallest-Last Ordering and Clustering and Graph Coloring Algorithms," J. ACM, vol. 30, no. 3, pp. 417-427, 1983.

Digital Library

[17]

{17} B. Mirkin and I. Muchnik, "Induced Layered Clusters, Hereditary Mappings, and Convex Geometries," Applied Math. Letters, vol. 15, pp. 293-298, 2002.

[18]

{18} M.L. Fredman and R.E. Tarjan, "Fibonacci Heaps and Their Uses in Improved Network Optimization," J. ACM, ACM, pp. 596-615, 1987.

Digital Library

[19]

{19} T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, second ed. The MIT Press, 2001.

Digital Library

[20]

{20} S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.

[21]

{21} A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall, and E.L.L. Sonnhammer, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 30, no. 1, pp. 276-280, 2002.

[22]

{22} A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, no. 1, pp. 138- 141, 2004.

[23]

{23} B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, "The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003," Nucleic Acids Research, vol. 31, no. 1, pp. 365-370, 2003.

[24]

{24} J. Wootton and S. Federhen, "Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases," Computers and Chemistry, vol. 17, pp. 149-163, 1993.

[25]

{25} A. Lupas, M. Van Dyke, and J. Stock, "Predicting Coiled Coils from Protein Sequences," Science, vol. 252, pp. 1162-1164, 1991.

[26]

{26} B.S. Everitt, S. Landau, and M. Leese, Cluster Analysis, fourth ed. Oxford Univ. Press, 2001.

[27]

{27} W.M. Rand, "Objective Criterion for the Evaluation of Clustering Methods," J. Am. Statistics Assoc., vol. 66, pp. 846-850, 1971.

[28]

{28} R. Guigo, I. Muchnik, and T.F. Smith, "Reconstruction of Ancient Molecular Phylogeny," Molecular Phylogenetic Evolution, vol. 6, no. 2, pp. 189-213, 1996.

[29]

{29} S.B. Cannon and N.D. Young, "OrthoParaMap: Distinguishing Orthologs from Paralogs by Integrating Comparative Genome Data and Gene Phylogenies," BMC Bioinformatics, vol. 4, no. 35, 2003.

Cited By

Levin M(2009)Combinatorial optimization in system configuration designAutomation and Remote Control10.1134/S000511790903018770:3(519-561)Online publication date: 1-Mar-2009
https://dl.acm.org/doi/10.1134/S0005117909030187

Index Terms

Ortholog Clustering on a Multipartite Graph

Recommendations

Ortholog clustering on a multipartite graph
WABI'05: Proceedings of the 5th International conference on Algorithms in Bioinformatics

We present a method for automatically extracting groups of orthologous genes from a large set of genomes through the development of a new clustering method on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from ...
Screening for ortholog clusters using multipartite graph clustering by quasi-concave set function optimization
RSFDGrC'05: Proceedings of the 10th international conference on Rough Sets, Fuzzy Sets, Data Mining, and Granular Computing - Volume Part II

Finding orthologous genes, similar genes in different genomes, is a fundamental problem in comparative genomics. We present a model for automatically extracting candidate ortholog clusters in a large set of genomes using a new clustering method for ...
Multipartite graph clustering for structured datasets and automating ortholog extraction

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4, Issue 1

January 2007

160 pages

ISSN:1545-5963

Issue’s Table of Contents

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2007

Published in TCBB Volume 4, Issue 1

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
515
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 22 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Levin M(2009)Combinatorial optimization in system configuration designAutomation and Remote Control10.1134/S000511790903018770:3(519-561)Online publication date: 1-Mar-2009
https://dl.acm.org/doi/10.1134/S0005117909030187

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents