skip to main content
article

Learning Scoring Schemes for Sequence Alignment from Partial Examples

Published: 01 October 2008 Publication History

Abstract

When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.

References

[1]
A. Bahr, J. D. Thompson, J. C. Thierry, and O. Poch, "BAliBASE (Benchmark Alignment dataBASE): Enhancements for Repeats, Transmembrane Sequences and Circular Permutations," Nucleic Acids Research, vol. 29, no. 1, pp. 323-326, 2001.
[2]
S. Balaji, S. Sujatha, S. S. C. Kumar, and N. Srinivasan, "PALI: A Database of Alignments and Phylogeny of Homologous Protein Structures," Nucleic Acids Research, vol. 29, no. 1, pp. 61-65, 2001.
[3]
H. Carrillo and D. Lipman, "The Multiple Sequence Alignment Problem in Biology," SIAM J. Applied Math., vol. 48, pp. 1073-1082, 1988.
[4]
W. Cook, W. Cunningham, W. Pulleyblank, and A. Schrijver, Combinatorial Optimization. John Wiley & Sons, 1998.
[5]
M. O. Dayhoff, R. M. Schwartz, and B. C. Orcutt, "A Model of Evolutionary Change in Proteins," Atlas of Protein Sequence and Structure, M. O. Dayhoff, ed., vol. 5, no. 3, pp. 345-352, Nat'l Biomedical Research Foundation, 1978.
[6]
C. Do, CONTRAlign: CONditional TRAining for Protein Sequence Alignment, version 1.04, http://contra.stanford.edu/contralign/, 2005.
[7]
C. Do, S. Gross, and S. Batzoglou, "CONTRAlign: Discriminative Training for Protein Sequence Alignment," Proc. 10th Conf. Research in Computational Molecular Biology (RECOMB '06), pp. 160-174, 2006.
[8]
R. C. Edgar, "MUSCLE: Multiple Sequence Alignment with High Accuracy and High Throughput," Nucleic Acids Research, vol. 32, pp. 1792-1797, 2004.
[9]
D. Eppstein, "Setting Parameters by Example," SIAM J. Computing , vol. 32, no. 3, pp. 643-653, 2003.
[10]
V. S. Gowri, S. B. Pandit, B. Anand, N. Srinivasan, and S. Balaji, PALI: Phylogeny and ALIgnment of Homologous Protein Structures, release 2.3, http://pauling.mbu.iisc.ernet.in/~pali, 2005.
[11]
J. R. Griggs, P. Hanlon, A. M. Odlyzko, and M. S. Waterman, "On the Number of Alignments of k Sequences," Graphs and Combinatorics, vol. 6, pp. 133-146, 1990.
[12]
M. Grötschel, L. Lovász, and A. Schrijver, "The Ellipsoid Method and Its Consequences in Combinatorial Optimization," Combinatorica , vol. 1, pp. 169-197, 1981.
[13]
M. Grötschel, L. Lovász, and A. Schrijver, Geometric Algorithms and Combinatorial Optimization. Springer, 1988.
[14]
D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge Univ. Press, 1997.
[15]
D. Gusfield, K. Balasubramanian, and D. Naor, "Parametric Optimization of Sequence Alignment," Algorithmica, vol. 12, pp. 312-326, 1994.
[16]
D. Gusfield and P. Stelling, "Parametric and Inverse-Parametric Sequence Alignment with XPARAL," Methods in Enzymology, vol. 266, pp. 481-494, 1996.
[17]
S. Henikoff and J. G. Henikoff, "Amino Acid Substitution Matrices from Protein Blocks," Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 10915-10919, 1992.
[18]
R. M. Karp and C. H. Papadimitriou, "On Linear Characterization of Combinatorial Optimization Problems," SIAM J. Computing, vol. 11, pp. 620-632, 1982.
[19]
J. Kececioglu and E. Kim, "Simple and Fast Inverse Alignment," Proc. 10th Conf. Research in Computational Molecular Biology (RECOMB '06), pp. 441-455, 2006.
[20]
J. Kececioglu and D. Starrett, "Aligning Alignments Exactly," Proc. Eighth ACM Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 85-96, 2004.
[21]
E. Kim and J. Kececioglu, "Inverse Sequence Alignment from Partial Examples," Proc. Seventh EATCS/ISCB Workshop Algorithms in Bioinformatics (WABI '07), pp. 359-370, 2007.
[22]
Y. Lu and S.-H. Sze, "Multiple Sequence Alignment Based on Profile Alignment of Intermediate Sequences," Proc. 11th Conf. Research in Computational Molecular Biology (RECOMB '07), pp. 283-295, 2007.
[23]
A. Makhorin, GLPK: GNU Linear Programming Kit, release 4.8, http://www.gnu.org/software/glpk/, 2005.
[24]
K. Mizuguchi, C. M. Deane, T. L. Blundell, and J. P. Overington, "HOMSTRAD: A Database of Protein Structure Alignments for Homologous Families," Protein Science, vol. 7, pp. 2469-2471, 1998.
[25]
A. G. Murzin, S. E. Brenner, T. Hubbard, and C. Chothia, "SCOP: A Structural Classification of Proteins Database for the Investigation of Sequences and Structures," J. Molecular Biology, vol. 247, pp. 536-540, 1995.
[26]
A. Murzin, J.-M. Chandonia, A. Andreeva, D. Howorth, L. Lo Conte, B. Ailey, S. Brenner, T. Hubbard, and C. Chothia, SCOP: Structural Classification of Proteins, release 1.69, http:// scop.berkeley.edu, 2005.
[27]
L. Pachter and B. Sturmfels, "Parametric Inference for Biological Sequence Analysis," Proc. Nat'l Academy of Sciences USA, vol. 101, no. 46, pp. 16138-16143, 2004.
[28]
M. W. Padberg and M. R. Rao, "The Russian Method for Linear Programming III: Bounded Integer Programming," Technical Report 81-39, Graduate School of Business and Administration, New York Univ., 1981.
[29]
J. Pei and N. V. Grishin, "PROMALS: Towards Accurate Multiple Sequence Alignments of Distantly Related Proteins," Bioinformatics , vol. 23, no. 7, pp. 802-808, 2007.
[30]
D. Starrett, T. J. Wheeler, and J. D. Kececioglu, AlignAlign: Software for Optimally Aligning Alignments, version 0.9.7, http:// alignalign.cs.arizona.edu, 2005.
[31]
F. Sun, D. Fernández-Baca, and W. Yu, "Inverse Parametric Sequence Alignment," J. Algorithms, vol. 53, pp. 36-54, 2004.
[32]
J. D. Thompson, D. G. Higgins, and T. J. Gibson, "CLUSTAL W: Improving the Sensitivity of Progressive Multiple Sequence Alignment through Sequence Weighting, Position-Specific Gap Penalties and Weight Matrix Choice," Nucleic Acids Research, vol. 22, pp. 4673-4680, 1994.
[33]
I. Van Walle, I. Lasters, and L. Wyns, "SABmark: A Benchmark for Sequence Alignment That Covers the Entire Known Fold Space," Bioinformatics, vol. 21, no. 7, pp. 1267-1268, 2005.
[34]
M. Waterman, M. Eggert, and E. Lander, "Parametric Sequence Comparisons," Proc. Nat'l Academy of Sciences USA, vol. 89, pp. 6090-6093, 1992.
[35]
T. J. Wheeler and J. D. Kececioglu, "Multiple Alignment by Aligning Alignments," Proc. 15th ISCB Conf. Intelligent Systems for Molecular Biology (ISMB '07), Bioinformatics, vol. 23, pp. i559- i568, 2007.
[36]
T. J. Wheeler and J. D. Kececioglu, Opal: Software for Aligning Multiple Biological Sequences, version 0.3.7, http://opal.cs.arizona. edu, 2007.
[37]
C.-N. Yu, T. Joachims, R. Elber, and J. Pillardy, "Support Vector Training of Protein Alignment Models," Proc. 11th Conf. Research in Computational Molecular Biology (RECOMB '07), pp. 253-267, 2007.
[38]
H. Zhou and Y. Zhou, "SPEM: Improving Multiple Sequence Alignment with Sequence Profiles and Predicted Secondary Structures," Bioinformatics, vol. 21, no. 18, pp. 3615-3621, 2005.

Cited By

View all
  • (2017)Position-Residue Specific Dynamic Gap Penalty Scoring Strategy for Multiple Sequence AlignmentProceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics10.1145/3156346.3156354(42-45)Online publication date: 7-Dec-2017
  • (2015)Parameterized BLOSUM matrices for protein alignmentIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.236612612:3(686-694)Online publication date: 1-May-2015
  • (2012)Estimating the accuracy of multiple alignments and its use in parameter advisingProceedings of the 16th Annual international conference on Research in Computational Molecular Biology10.1007/978-3-642-29627-7_5(45-59)Online publication date: 21-Apr-2012

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 5, Issue 4
October 2008
158 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 October 2008
Published in TCBB Volume 5, Issue 4

Author Tags

  1. Analysis of Algorithms and Problem Complexity
  2. Biology and genetics
  3. Linear programming
  4. Pattern matching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Position-Residue Specific Dynamic Gap Penalty Scoring Strategy for Multiple Sequence AlignmentProceedings of the 8th International Conference on Computational Systems-Biology and Bioinformatics10.1145/3156346.3156354(42-45)Online publication date: 7-Dec-2017
  • (2015)Parameterized BLOSUM matrices for protein alignmentIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.236612612:3(686-694)Online publication date: 1-May-2015
  • (2012)Estimating the accuracy of multiple alignments and its use in parameter advisingProceedings of the 16th Annual international conference on Research in Computational Molecular Biology10.1007/978-3-642-29627-7_5(45-59)Online publication date: 21-Apr-2012

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media