Abstract
It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.
- C. Andrieu, J. Freitas, and A. Doucet, “Robust Full Bayesian Learning for Neural Networks,” Neural Computing, vol. 13, pp.2359-2407, 2001. Google ScholarDigital Library
- Applied Biosystems. “Applied Biosystems 3730/3730xl DNA Analyzers: Sequencing Chemistry Guide: Rev B,” http://www.appliedbiosystems.com/, 2002.Google Scholar
- P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Basecalling Using Hidden Markov Models,” J. Franklin Inst., vol. 341, pp. 23-36, 2004.Google ScholarCross Ref
- G. Casella and E.I. George, “Explaining the Gibbs Sampler,” The Am. Statistician, vol. 46, pp. 167-174, Aug. 1992.Google Scholar
- M. Chien et al., “The Genomic Sequence of the Accidental Pathogen Legionella Pneumophila,” Science, vol. 305, pp. 1966-1968, 2004.Google ScholarCross Ref
- R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999.Google Scholar
- B. Alberts et al., Essential Cell Biology. Garland Science, 2003.Google Scholar
- B. Ewing, L. Hillier, M.C. Wendl, and P. Green, “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assement,” Genome Research, vol. 8, pp. 175-185, 1998.Google ScholarCross Ref
- W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996.Google Scholar
- N. Haan and S.J. Godsill, “Modelling Electropherogram Data for DNA Sequencing Using Variable Dimension,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '00), vol. 6, pp. 3542-3545, June 2000. Google ScholarDigital Library
- N. Haan and S.J. Godsill, “Sequential Methods for DNA Sequencing,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '01), vol. 2, pp. 1045-1048, May 2001. Google ScholarDigital Library
- Technelysium Pty Ltd., “Chromas,” http://www.technelysium. com.au/chromas.html, 2004.Google Scholar
- R.E. Mahony, G.D. Brushe, and J.B. Moore, “Hybrid Algorithms for Maximum Likelihood and Maximum A Posteriori Sequence Estimation,” Proc. Int'l Conf. Signal Processing and Applications (ISSPA '96), 1996.Google Scholar
- L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp.257-285, Feb. 1989. Google ScholarDigital Library
- F. Sanger, S. Nicklen, and A.R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Nat'l Academy of Sciences USA, vol. 74, pp. 5463-5467, 1977.Google ScholarCross Ref
- T. Vercauteren, A. Lopez, and X. Wang, “Estimating the Number of Competing Terminals in an IEEE 802.11 Wireless Network,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), Mar. 2005.Google Scholar
- X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E.R. Dougherty, “A Bayesian Connectivity-Based Approach to Constructing Probabilistic Gene Regulatory Networks,” Bioinformatics, vol. 20, no. 17, pp. 2918-2927, Nov. 2004. Google ScholarDigital Library
Index Terms
- Bayesian Basecalling for DNA Sequence Analysis Using Hidden Markov Models
Recommendations
Bayesian hidden Markov model for DNA sequence segmentation: A prior sensitivity analysis
The sensitivity to the specification of the prior in a hidden Markov model describing homogeneous segments of DNA sequences is considered. An intron from the chimpanzee @a-fetoprotein gene, which plays an important role in embryonic development in ...
Recognition of DNA gene fragments using hidden Markov models
A model of the recognition of functional sites of genes in DNA on the basis of hidden Markov models is considered. It is shown how algorithms based on Markov chain models of various orders can be used to detect fragments of genes of three genomes of ...
DNA Sequence Matching Using Boolean Algebra
ACE '10: Proceedings of the 2010 International Conference on Advances in Computer EngineeringAlignment is the most basic component of biological sequence manipulation and has diverse applications in sequence assembly, sequence annotation, structural and functional predictions for genes and proteins, phylogeny and evolutionary analysis. ...
Comments