skip to main content
article

Correcting Base-Assignment Errors in Repeat Regions of Shotgun Assembly

Published: 01 January 2007 Publication History

Abstract

Accurate base-assignment in repeat regions of a whole genome shotgun assembly is an unsolved problem. Since reads in repeat regions cannot be easily attributed to a unique location in the genome, current assemblers may place these reads arbitrarily. As a result, the base-assignment error rate in repeats is likely to be much higher than that in the rest of the genome. We developed an iterative algorithm, EULER-AIR, that is able to correct base-assignment errors in finished genome sequences in public databases. The Wolbachia genome is among the best finished genomes. Using this genome project as an example, we demonstrated that EULER-AIR can 1) discover and correct base-assignment errors, 2) provide accurate read assignments, 3) utilize finishing reads for accurate base-assignment, and 4) provide guidance for designing finishing experiments. In the genome of Wolbachia, EULER-AIR found 16 positions with ambiguous base-assignment and two positions with erroneous bases. Besides Wolbachia, many other genome sequencing projects have significantly fewer finishing reads and, hence, are likely to contain more base-assignment errors in repeats. We demonstrate that EULER-AIR is a software tool that can be used to find and correct base-assignment errors in a genome assembly project.

References

[1]
{1} J.A. Bailey, A.M. Yavor, H.F. Massa, B.J. Trask, and E.E. Eichler, "Segmental Duplications: Organization and Impact within the Current Human Genome Project Assembly," Genome Research, vol. 11, no. 6, pp. 1005-1017, 2001.
[2]
{2} P.A. Pevzner and H. Tang, "Fragment Assembly with Double-Barreled Data," Bioinformatics, vol. 17 (suppl 1 (special ISMB 2001 issue)), pp. 225-233, 2001.
[3]
{3} G.A. Churchill and M.S. Waterman, "The Accuracy of DNA Sequences: Estimating Sequence Quality," Genomics, vol. 14, no. 1, pp. 89-98, 1992.
[4]
{4} D. Gordon, C. Desmarais, and P. Green, "Automated Finishing with Autofinish," Genome Research, vol. 11, no. 4, pp. 614-625, 2001.
[5]
{5} E. Czabarka, G. Konjevod, M. Marathe, A. Percus, and D. Torney, "Algorithms for Optimizing Production DNA Sequencing," Proc. 11th Ann. ACM-SIAM Symp. Discrete Algorithms (SODA), pp. 399- 408, 2000.
[6]
{6} J. Kececioglu and J. Yu, "Separating Repeats in DNA Sequence Assembly," Proc. Fifth ACM Conf. Computational Molecular Biology (RECOMB), pp. 176-183, 2001.
[7]
{7} G. Myers, "Optimally Separating Sequences," Genome Informatics, vol. 12, pp. 165-174, 2001.
[8]
{8} M.T. Tammi, E. Arner, T. Britton, and B. Andersson, "Separation of Nearly Identical Repeats in Shotgun Assemblies Using Defined Nucloetide Positions," DNPs. Bioinformatics, vol. 18, no. 3, pp. 379- 388, 2002.
[9]
{9} TIGR Benchmark Data for Genome Assembly, http://www.tigr.org/ tdb/benchmark, 1995-2005.
[10]
{10} R.V. Samonte and E.E. Eichler, "Segmental Duplications and the Evolution of the Primate Genome," Nature Rev. Genetics, vol. 3, no. 1, pp. 65-72, 2002.
[11]
{11} M.A. Batzer and P.L. Deininger, "Alu Repeats and Human Genomic Diversity," Nature Rev. Genetics, vol. 3, no. 5, pp. 370- 379, 2002.
[12]
{12} Int'l Human Genome Sequencing Consortium, "Initial Sequencing and Analysis of the Human Genome," Nature vol. 409, pp. 860- 921, 2001.
[13]
{13} A.P. Dempster, N.M. Laird, and D.B. Rubin, "Maximum Likelihood from Incomplete Data via the EM Algorithm," J. Royal Statistical Soc. B, vol. 39, no. 1, 1977.
[14]
{14} EULER-AIR, Web site http://www-cse.ucsd.edu/groups/ bioinformatics/euler_air, 2004.
[15]
{15} S. Batzoglou, D.B. Jaffe, K. Stanley, J. Butler, S. Gnerre, E. Mauceli, B. Berger, J.P. Mesirov, and E.S. Lander, "ARACHNE: A Whole-Genome Shotgun Assembler," Genome Research, vol. 12, no. 1, pp. 177-189, 2002.
[16]
{16} P. Pevzner, H. Tang, and G. Tesler, "De Novo Repeat Classification and Fragment Assembly," Proc. Eighth ACM Conf. Computational Molecular Biology (RECOMB), 2004.
[17]
{17} M. Pop, D. Kosack, and S. Salzberg, "Hierarchical Scaffolding with Bambus," Genome Research, vol. 14, no. 1, pp. 149-159, 2004.
[18]
{18} P. Chain et al., "Complete Genome Sequence of the Ammonia-Oxidizing Bacterium and Obligate Chemolithoautotroph Nitrosomonas europaea," J. Bacteriology, vol. 185, no. 9, pp. 2759-2773, 2003.
[19]
{19} A. Bolotin, P. Wincker, S. Mauger, O. Jaillon, K. Malarme, J. Weissenbach, S.D. Ehrlich, and A. Sorokin, "The Complete Genome Sequence of the Lactic Acid Bacterium Lactococcus lactis ssp. lactis IL1403," Genome Research, vol. 11, no. 5, pp. 731-753, 2001.
[20]
{20} Nat'l Inst. Agricultural Research, France, http://www.inra.fr, 2001.
[21]
{21} J. Parkhill et al., "The Genome Sequence of the Food-Borne Pathogen Campylobacter Jejuni Reveals Hypervariable Sequences," Nature, vol. 403, pp. 665-668, 2000.
[22]
{22} H. Tettelin et al., "Complete Genome Sequence of Neisseria meningitidis Serogroup B Strain MC58," Science, vol. 287, no. 5459, pp. 1809-1815, 2000.
[23]
{23} J. Kaminker et al., "The Transposable Elements of the Drosophila Melanogaster Euchromatin: A Genomics Perspective," Genome Biology, vol. 3, no. 12, research0084.1-0084.20, 2002.
[24]
{24} J.W. Kent, "BLAT-The BLAST-Like Alignment Tool," Genome Research, vol. 12, no. 4, pp. 656-664, 2002.
[25]
{25} A.L. Delcher, A. Phillippy, J. Carlton, and S.L. Salzberg, "Fast Algorithms for Large-Scale Genome Alignment and Comparison," Nucleic Acids Research, vol. 30, no. 11, pp. 2478-2483, 2002.
[26]
{26} S.F. Altschul, T.L. Madden, A.A. Schffer, J. Zhang, Z. Zhang, W. Miller, and D.J. Lipman, "Gapped BLAST and PSI-BLAST: A New Generation of Protein Database Search Programs," Nuclear Acid Research, vol. 25, no. 17, pp. 3389-3402, 1997.
[27]
{27} E.L. Anson and E.W. Myers, "ReAligner: A Program for Refining DNA Sequence Multi-Alignments," Proc. First ACM Conf. Computational Molecular Biology (RECOMB), pp. 9-16, 1997.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 4, Issue 1
January 2007
160 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 January 2007
Published in TCBB Volume 4, Issue 1

Author Tags

  1. Fragment assembly
  2. expectation maximization.
  3. finishing

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 255
    Total Downloads
  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media