skip to main content
article

Ortholog Clustering on a Multipartite Graph

Published: 01 January 2007 Publication History

Abstract

We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E|+|V| log|V|), where V and E are the sets of vertices and edges, respectively, in the graph. However, if we discretize the similarity scores into a constant number of bins, the runtime improves to O(|E|+|V|). The proposed method was applied to seven complete eukaryote genomes on which the manually curated database of eukaryotic ortholog clusters, KOG, is constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters.

References

[1]
{1} W.M. Fitch, "Distinguishing Homologous from Analogous Proteins," Systematic Zoology, vol. 19, pp. 99-113, 1970.
[2]
{2} E.V. Koonin, "An Apology for Orthologs--Or Brave New Memes," Genome Biology, vol. 2, no. 4, 2001.
[3]
{3} V. Solovyev and I. Shahmuradov, "PromH: Promoters Identification Using Identification Orthologous Genomic Sequences," Nucleic Acids Research, vol. 31, no. 13, pp. 3540-3545, 2003.
[4]
{4} B. Lemos, B.R. Bettencourt, C.D. Meiklejohn, and D.L. Hartl, "Evolution of Proteins and Gene Expression Levels Are Coupled in Drosophila and Are Independently Associated with mRNA Abundance, Protein Length, and Number of Protein-Protein Interactions," Molecular Biology Evolution, vol. 22, pp. 1345-1354, 2005.
[5]
{5} W. Fujibuchi, H. Ogata, H. Matsuda, and M. Kanehisa, "Automatic Detection of Conserved Gene Clusters in Multiple Genomes by Graph Comparison and p-Quasi Grouping," Nucleic Acids Research, vol. 28, no. 2, pp. 4096-4036, 2002.
[6]
{6} M. Kamvysselis, N. Patterson, B. Birren, B. Berger, and E. Lander, "Whole-Genome Comparative Annotation and Regulatory Motif Discovery in Multiple Yeast Species," Proc. Int'l Conf. Research in Computational Molecular Biology (RECOMB), pp. 157-166, 2003.
[7]
{7} M. Remm, C.E. Strom, and E.L. Sonnhammer, "Automatic Clustering of Orthologs and In-Paralogs from Pairwise Species Comparisons," J. Molecular Biology, vol. 314, pp. 1041-1052, 2001.
[8]
{8} R. Tatusov, N. Fedorova, J. Jackson, A. Jacobs, B. Kiryutin, E. Koonin, D. Krylov, R.M.R.S. Mekhedov, A. Nikolskaya, B. Rao, S. Smirnov, A. Sverdlov, S. Vasudevan, Y. Wolf, J. Yin, and D. Natale, "The cOG Database: An Updated Version Includes Eukaryotes," BioMed Central Bioinformatics, 2003.
[9]
{9} R.L. Tatusov, E.V. Koonin, and D.J. Lipman, "A Genomic Perspective on Protein Families," Science, vol. 278, pp. 631-637, 1997.
[10]
{10} C.M. Zmasek and S.R. Eddy, "RIO: Analyzing Proteomes by Automated Phylogenomics Using Resampled Inference of Orthologs," BMC Bioinformatics, vol. 3, no. 14, 2002.
[11]
{11} M.A. Huynen and P. Bork, "Measuring Genome Evolution," Proc. Nat'l Academy of Sciences USA, vol. 95, pp. 5849-5856, 1998.
[12]
{12} J. Tang and B. Moret, "Phylogenetic Reconstruction from Gene Rearrangement Data with Unequal Gene Content," Proc. Eighth Workshop Algorithms and Data Structures (WADS'03), pp. 37-46, 2003.
[13]
{13} M. Dawande, P. Keskinocak, J.M. Swaminathan, and S. Tayur, "On Bipartite and Multipartite Clique Problems," J. Algorithms, vol. 41, pp. 388-403, 2001.
[14]
{14} J. Pei, D. Jiang, and A. Zhang, "On Mining Cross Graph Quasi-Cliques," Proc. ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD'05), 2005.
[15]
{15} J. Abello, M.G. Resende, and S. Sudarsky, "Massive Quasi-Clique Detection," Proc. Latin Am. Theoretical Informatics Symp. (LATIN '02), 2002.
[16]
{16} D.W. Matula and L.L. Beck, "Smallest-Last Ordering and Clustering and Graph Coloring Algorithms," J. ACM, vol. 30, no. 3, pp. 417-427, 1983.
[17]
{17} B. Mirkin and I. Muchnik, "Induced Layered Clusters, Hereditary Mappings, and Convex Geometries," Applied Math. Letters, vol. 15, pp. 293-298, 2002.
[18]
{18} M.L. Fredman and R.E. Tarjan, "Fibonacci Heaps and Their Uses in Improved Network Optimization," J. ACM, ACM, pp. 596-615, 1987.
[19]
{19} T. Cormen, C. Leiserson, R. Rivest, and C. Stein, Introduction to Algorithms, second ed. The MIT Press, 2001.
[20]
{20} S. Altschul, T. Madden, A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nucleic Acids Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[21]
{21} A. Bateman, E. Birney, L. Cerruti, R. Durbin, L. Etwiller, S.R. Eddy, S. Griffiths-Jones, K.L. Howe, M. Marshall, and E.L.L. Sonnhammer, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 30, no. 1, pp. 276-280, 2002.
[22]
{22} A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, no. 1, pp. 138- 141, 2004.
[23]
{23} B. Boeckmann, A. Bairoch, R. Apweiler, M.-C. Blatter, A. Estreicher, E. Gasteiger, M.J. Martin, K. Michoud, C. O'Donovan, I. Phan, S. Pilbout, and M. Schneider, "The SWISS-PROT Protein Knowledgebase and Its Supplement TrEMBL in 2003," Nucleic Acids Research, vol. 31, no. 1, pp. 365-370, 2003.
[24]
{24} J. Wootton and S. Federhen, "Statistics of Local Complexity in Amino Acid Sequences and Sequence Databases," Computers and Chemistry, vol. 17, pp. 149-163, 1993.
[25]
{25} A. Lupas, M. Van Dyke, and J. Stock, "Predicting Coiled Coils from Protein Sequences," Science, vol. 252, pp. 1162-1164, 1991.
[26]
{26} B.S. Everitt, S. Landau, and M. Leese, Cluster Analysis, fourth ed. Oxford Univ. Press, 2001.
[27]
{27} W.M. Rand, "Objective Criterion for the Evaluation of Clustering Methods," J. Am. Statistics Assoc., vol. 66, pp. 846-850, 1971.
[28]
{28} R. Guigo, I. Muchnik, and T.F. Smith, "Reconstruction of Ancient Molecular Phylogeny," Molecular Phylogenetic Evolution, vol. 6, no. 2, pp. 189-213, 1996.
[29]
{29} S.B. Cannon and N.D. Young, "OrthoParaMap: Distinguishing Orthologs from Paralogs by Integrating Comparative Genome Data and Gene Phylogenies," BMC Bioinformatics, vol. 4, no. 35, 2003.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 4, Issue 1
January 2007
160 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2007
Published in TCBB Volume 4, Issue 1

Author Tags

  1. Graph-theoretic methods
  2. biology
  3. clustering algorithms
  4. genetics.

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media