skip to main content
article

Bayesian Basecalling for DNA Sequence Analysis Using Hidden Markov Models

Published:01 July 2007Publication History
Skip Abstract Section

Abstract

It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.

References

  1. C. Andrieu, J. Freitas, and A. Doucet, “Robust Full Bayesian Learning for Neural Networks,” Neural Computing, vol. 13, pp.2359-2407, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Applied Biosystems. “Applied Biosystems 3730/3730xl DNA Analyzers: Sequencing Chemistry Guide: Rev B,” http://www.appliedbiosystems.com/, 2002.Google ScholarGoogle Scholar
  3. P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Basecalling Using Hidden Markov Models,” J. Franklin Inst., vol. 341, pp. 23-36, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. G. Casella and E.I. George, “Explaining the Gibbs Sampler,” The Am. Statistician, vol. 46, pp. 167-174, Aug. 1992.Google ScholarGoogle Scholar
  5. M. Chien et al., “The Genomic Sequence of the Accidental Pathogen Legionella Pneumophila,” Science, vol. 305, pp. 1966-1968, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999.Google ScholarGoogle Scholar
  7. B. Alberts et al., Essential Cell Biology. Garland Science, 2003.Google ScholarGoogle Scholar
  8. B. Ewing, L. Hillier, M.C. Wendl, and P. Green, “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assement,” Genome Research, vol. 8, pp. 175-185, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  9. W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996.Google ScholarGoogle Scholar
  10. N. Haan and S.J. Godsill, “Modelling Electropherogram Data for DNA Sequencing Using Variable Dimension,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '00), vol. 6, pp. 3542-3545, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Haan and S.J. Godsill, “Sequential Methods for DNA Sequencing,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '01), vol. 2, pp. 1045-1048, May 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Technelysium Pty Ltd., “Chromas,” http://www.technelysium. com.au/chromas.html, 2004.Google ScholarGoogle Scholar
  13. R.E. Mahony, G.D. Brushe, and J.B. Moore, “Hybrid Algorithms for Maximum Likelihood and Maximum A Posteriori Sequence Estimation,” Proc. Int'l Conf. Signal Processing and Applications (ISSPA '96), 1996.Google ScholarGoogle Scholar
  14. L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp.257-285, Feb. 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. F. Sanger, S. Nicklen, and A.R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Nat'l Academy of Sciences USA, vol. 74, pp. 5463-5467, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  16. T. Vercauteren, A. Lopez, and X. Wang, “Estimating the Number of Competing Terminals in an IEEE 802.11 Wireless Network,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), Mar. 2005.Google ScholarGoogle Scholar
  17. X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E.R. Dougherty, “A Bayesian Connectivity-Based Approach to Constructing Probabilistic Gene Regulatory Networks,” Bioinformatics, vol. 20, no. 17, pp. 2918-2927, Nov. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Bayesian Basecalling for DNA Sequence Analysis Using Hidden Markov Models

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader