ACM Home Page
Please provide us with feedback. Feedback
Bayesian Basecalling for DNA Sequence Analysis Using Hidden Markov Models
Full text PdfPdf (470 KB)
Source IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) archive
Volume 4 ,  Issue 3  (July 2007) table of contents
Pages 430-440  
Year of Publication: 2007
ISSN:1545-5963
Authors
Publisher
IEEE Computer Society Press  Los Alamitos, CA, USA
Bibliometrics
Downloads (6 Weeks): 15,   Downloads (12 Months): 102,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: 10.1109/tcbb.2007.1027

ABSTRACT

It has been shown that electropherograms of DNA sequences can be modeled with hidden Markov models. Basecalling, the procedure that determines the sequence of bases from the given eletropherogram, can then be performed using the Viterbi algorithm. A training step is required prior to basecalling in order to estimate the HMM parameters. In this paper, we propose a Bayesian approach which employs the Markov chain Monte Carlo (MCMC) method to perform basecalling. Such an approach not only allows one to naturally encode the prior biological knowledge into the basecalling algorithm, it also exploits both the training data and the basecalling data in estimating the HMM parameters, leading to more accurate estimates. Using the recently sequenced genome of the organism Legionella pneumophila we show that the MCMC basecaller outperforms the state-of-the-art basecalling algorithm in terms of total errors while requiring much less training than other proposed statistical basecallers.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
C. Andrieu, J. Freitas, and A. Doucet, “Robust Full Bayesian Learning for Neural Networks,” Neural Computing, vol. 13, pp.2359-2407, 2001.
 
2
Applied Biosystems. “Applied Biosystems 3730/3730xl DNA Analyzers: Sequencing Chemistry Guide: Rev B,” http://www.appliedbiosystems.com/, 2002.
 
3
P. Boufounos, S. El-Difrawy, and D. Ehrlich, “Basecalling Using Hidden Markov Models,” J. Franklin Inst., vol. 341, pp. 23-36, 2004.
 
4
G. Casella and E.I. George, “Explaining the Gibbs Sampler,” The Am. Statistician, vol. 46, pp. 167-174, Aug. 1992.
 
5
M. Chien et al., “The Genomic Sequence of the Accidental Pathogen Legionella Pneumophila,” Science, vol. 305, pp. 1966-1968, 2004.
 
6
R. Durbin, S.R. Eddy, A. Krogh, and G. Mitchison, Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge Univ. Press, 1999.
 
7
B. Alberts et al., Essential Cell Biology. Garland Science, 2003.
 
8
B. Ewing, L. Hillier, M.C. Wendl, and P. Green, “Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assement,” Genome Research, vol. 8, pp. 175-185, 1998.
 
9
W.R. Gilks, S. Richardson, and D.J. Spiegelhalter, Markov Chain Monte Carlo in Practice. Chapman & Hall, 1996.
 
10
N. Haan and S.J. Godsill, “Modelling Electropherogram Data for DNA Sequencing Using Variable Dimension,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '00), vol. 6, pp. 3542-3545, June 2000.
 
11
N. Haan and S.J. Godsill, “Sequential Methods for DNA Sequencing,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '01), vol. 2, pp. 1045-1048, May 2001.
 
12
Technelysium Pty Ltd., “Chromas,” http://www.technelysium. com.au/chromas.html, 2004.
 
13
R.E. Mahony, G.D. Brushe, and J.B. Moore, “Hybrid Algorithms for Maximum Likelihood and Maximum A Posteriori Sequence Estimation,” Proc. Int'l Conf. Signal Processing and Applications (ISSPA '96), 1996.
 
14
L.R. Rabiner, “A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition,” Proc. IEEE, vol. 77, no. 2, pp.257-285, Feb. 1989.
 
15
F. Sanger, S. Nicklen, and A.R. Coulson, “DNA Sequencing with Chain-Terminating Inhibitors,” Proc. Nat'l Academy of Sciences USA, vol. 74, pp. 5463-5467, 1977.
 
16
T. Vercauteren, A. Lopez, and X. Wang, “Estimating the Number of Competing Terminals in an IEEE 802.11 Wireless Network,” Proc. IEEE Int'l Conf. Acoustics, Speech, and Signal Processing (ICASSP '05), Mar. 2005.
 
17
X. Zhou, X. Wang, R. Pal, I. Ivanov, M. Bittner, and E.R. Dougherty, “A Bayesian Connectivity-Based Approach to Constructing Probabilistic Gene Regulatory Networks,” Bioinformatics, vol. 20, no. 17, pp. 2918-2927, Nov. 2004.

Collaborative Colleagues:
Kuo-ching Liang: colleagues
Xiaodong Wang: colleagues
Dimitris Anastassiou: colleagues