|
ABSTRACT
We propose a method for automated extraction of protein-protein interactions from scientific text. Our system matches sentences against syntax patterns typically describing protein interactions. We define a set of 22 patterns, each a regular expression consisting of anchor positions and parameterizable constraints. This small set is then refined and optimized using a genetic algorithm on a training set. No heuristic definitions are necessary, and the final pattern set can be generated completely without manual curation. Our method can be applied to any syntax pattern-based protein-protein interaction system and thus complements related work on building comprehensive sets of such patterns. The application of different fitness-functions during evolution provides an easy way to tune the system either toward precision, recall, or f-measure. We evaluate our system on two samples, one derived from the BioCreAtIvE corpus, the other from references in the DIP. The automatic refinement of patterns adds up to 16% to the precision, and 5% to the recall of our system. We additionally study the impact of a proper protein name recognition, which could improve precision by about 17% and recall by 12%.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
BioCreAtIvE Evaluation, 2003. http://www.pdg.cnb. uam.es/BioLINK/BioCreative.eval.html.
|
| |
2
|
G. Bader, D. Betel, and C. H. CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Research, 31(1):248--250, Jan 1 2003. http://bind.ca/.
|
| |
3
|
|
| |
4
|
Nikolai Daraselia , Anton Yuryev , Sergei Egorov , Svetalana Novichkova , Alexander Nikitin , Ilya Mazo, Extracting human protein interactions from MEDLINE using a full-sentence parser, Bioinformatics, v.20 n.5, p.604-611, March 2004
[doi> 10.1093/bioinformatics/btg452]
|
| |
5
|
J. Hakenberg, S. Bickel, C. Plake, U. Brefeld, H. Zahn, L. Faulstich, U. Leser, and T. Scheffer. Systematic Feature Evaluation for Gene Name Recognition. BMC Bioinformatics, 2004. To appear.
|
| |
6
|
L. Issel-Tarver, K. Christie, K. Dolinski, R. Andrada, R. Balakrishnan et al. Saccharomyces Genome Database. Methods Enzymol, 350:329--346, 2002.
|
| |
7
|
E. Marcotte, I. Xenarios, and D. Eisenberg. Mining Literature for Protein Interactions. Bioinformatics, 17:359--363, April 2001.
|
| |
8
|
T. Ono, H. Hishigaki, A. Tanigami, and T. Takagi. Automated extraction of information on protein-protein-interactions from the biological literature. Bioinformatics, 17(2): 155--161, 2001.
|
| |
9
|
|
| |
10
|
J. Pustejovsky, J. Castano, J. Zhang, M. Kotecki, and B. Cochran. Robust Relational Parsing over Biomedical Literature: Extracting Inhibit Relations. In Proc 7th Pac Symp Biocomput, pages 362--373, 2002.
|
| |
11
|
L. Salwinski, C. Miller, A. Smith, F. Pettit, J. Bowie, and D. Eisenberg. The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 32, Database issue:D449--51, 2004.
|
| |
12
|
T. Sekimizu, H. S. Park, and J. Tsujii. Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. In Proc Genome Informatics, volume 9, pages 62--71, 1998.
|
| |
13
|
D. Wheeler, D. Church, S. Federhen, A. Lash, T. Madden et al Database Resources of the National Center for Biotechnology Information. Nucleic Acids Research, 31(1):28--33, 2003.
|
| |
14
|
I. Xenarios, E. Fernandez, L. Salwinski, X. Duan, M. Thompson, E. Marcotte, and D. Eisenberg. DIP: the database of interacting proteins: 2001 update. Nucleic Acids Res, 29(1):239--241, 2001.
|
| |
15
|
A. Yeh, A. Morgan, M. Colosimo, and L. Hirschman. BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics, 2004.
|
|