article

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

Authors:
Su Yan

IBM, San Jose

IBM, San Jose
View Profile

,
W. Scott Spangler

IBM, San Jose

IBM, San Jose
View Profile

,
Ying Chen

IBM, San Jose

IBM, San Jose
View Profile

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 10 Issue 5pp 1218–1233https://doi.org/10.1109/TCBB.2013.101

Published:01 September 2013Publication History

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.

References

K. Pastra, D. Maynard, H. Cunningham, O. Hamza, and Y. Wilks, "How Feasible Is the Reuse of Grammars for Named Entity Recognition?" Proc. Third Language Resources and Evaluation Conf., 2002.Google Scholar
K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 805-818, 2008. Google ScholarDigital Library
I. U. of Pure, A. C. C. on the Nomenclature of Organic Chemistry, R. Panico, W. Powell, and J. Richer, A Guide to IUPAC Nomenclature of Organic Compounds: Recommendations 1993, IUPAC Chemical Data Series, 1993.Google Scholar
G.A. Eller, "Improving the Quality of Published Chemical Names with Nomenclature Software," Molecules, vol. 11, no. 11, pp. 915- 928, 2006.Google ScholarCross Ref
L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Readings in Speech Recognition, pp. 267-296, Morgan Kaufmann, 1990. Google ScholarDigital Library
C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept. 1995. Google ScholarCross Ref
A. McCallum, D. Freitag, and F.C.N. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," Proc. 18th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000. Google ScholarDigital Library
J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 282-289, 2001. Google ScholarDigital Library
I.V. Filippov and M.C. Nicklaus, "Optical Structure Recognition Software to Recover Chemical Information: OSRA, an Open Source Solution." J. Chemical Information and Modeling, vol. 49, no. 3, pp. 740-743, 2009.Google ScholarCross Ref
S. Yan, W.S. Spangler, and Y. Chen, "Cross Media Entity Extraction and Linkage for Chemical Documents," Proc. 25th AAAI Conf. Artificial Intelligence (AAAI '11), 2011.Google Scholar
D. Weininger, "SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules," J. Chemical Information and Computer Science, vol. 28, no. 1, pp. 31- 36, Feb. 1988. Google ScholarDigital Library
R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C.M. Friedrich, "Detection of IUPAC and IUPAC-Like Chemical Names," Bioinformatics, vol. 24, pp. 268-276, 2008. Google ScholarDigital Library
P. Corbett, C. Batchelor, and S. Teufel, "Annotation of Chemical Named Entities," Proc. Workshop BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07), pp. 57-64, 2007. Google ScholarDigital Library
B. Sun, P. Mitra, and C.L. Giles, "Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web," Proc. Int'l Conf. World Wide Web (WWW '08), pp. 735-744, 2008. Google ScholarDigital Library
C.M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck, "Biomedical and Chemical Named Entity Recognition with Conditional Random Fields: The Advantage of Dictionary Features," Proc. Second Int'l Symp. Semantic Mining in Biomedicine (SMBM '06), pp. 85-89, 2006.Google Scholar
M. Krallinger, "BioCreAtIve Challenge Evaluation," http:// biocreative.sourceforge.net/, 2013.Google Scholar
H.A. Simon, "On a Class of Skew Distribution Functions," Biometrika, vol. 42, nos. 3-4, pp. 425-440, 1955.Google ScholarCross Ref
L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith, "Extension of Zipf's Law to Words and Phrases," Proc. 19th Int'l Conf. Computational Linguistics, pp. 1-6, 2002. Google ScholarDigital Library
C. Biemann, "A Random Text Model for the Generation of Statistical Language Invariants," Proc. Human Language Technologies: The Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, pp. 105-112, Apr. 2007.Google Scholar
G.K. Zipf, Human Behavior and the Principle of Least Effort. Martino Fine Books, 1949.Google Scholar
A.C. Bulhak, "On the Simulation of Postmodernism and Mental Debility Using Recursive Transition Networks," technical report, 1996.Google Scholar
J. Stribling, M. Krohn, and D. Aguayo, "SCIgen--An Automatic CS Paper Generator," http://www.pdos.lcs.mit.edu/scigen/, 2006.Google Scholar
N. Chomsky, "Three Models for the Description of Language," IRE Trans. Information Theory, vol. 2, pp. 113-124, 1956.Google ScholarCross Ref
D. Walter, Structure-Based Approaches to the Indexing and Retrieval of Patent Chemistry. Thomson Reuters, 2010.Google Scholar
W.J. Wilbur, G.F. Hazard, G. Divita, J.G. Mork, A.R. Aronson, and A.C. Browne, "Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods," Proc. AMIA Symp., pp. 176-180, 1999.Google Scholar
P. Corbett and A. Copestake, "Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition," Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP '08), pp. 54-62, 2008. Google ScholarDigital Library
C. Sutton and A. Mccallum, Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.Google Scholar
N. Okazaki, "CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs)," http://www.chokkan.org/software/ crfsuite/, 2007.Google Scholar

Index Terms

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
1. Applied computing
  1. Life and medical sciences
2. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection

Recommendations

Rich set of features for proper name recognition in polish texts
SIIS'11: Proceedings of the 2011 international conference on Security and Intelligent Information Systems

In this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic ...
Read More
Two learning approaches for protein name extraction

Protein name extraction, one of the basic tasks in automatic extraction of information from biological texts, remains challenging. In this paper, we explore the use of two different machine learning techniques and present the results of the conducted ...
Read More
Identification of Chemical Entities in Patent Documents
IWANN '09: Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living

Biomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 10, Issue 5
September 2013
225 pages
ISSN:1545-5963
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
IEEE Computer Society Press
Washington, DC, United States
Publication History
- Published: 1 September 2013
Published in tcbb Volume 10, Issue 5
Author Tags
Chemical name extraction
IUPAC names
conditional random fields
drug research
feature design
formal grammar
patent analysis
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 132
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Rich set of features for proper name recognition in polish texts

Two learning approaches for protein name extraction

Identification of Chemical Entities in Patent Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Rich set of features for proper name recognition in polish texts

Two learning approaches for protein name extraction

Identification of Chemical Entities in Patent Documents

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media