skip to main content
article

Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

Published:01 September 2013Publication History
Skip Abstract Section

Abstract

The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.

References

  1. K. Pastra, D. Maynard, H. Cunningham, O. Hamza, and Y. Wilks, "How Feasible Is the Reuse of Grammars for Named Entity Recognition?" Proc. Third Language Resources and Evaluation Conf., 2002.Google ScholarGoogle Scholar
  2. K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 805-818, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. U. of Pure, A. C. C. on the Nomenclature of Organic Chemistry, R. Panico, W. Powell, and J. Richer, A Guide to IUPAC Nomenclature of Organic Compounds: Recommendations 1993, IUPAC Chemical Data Series, 1993.Google ScholarGoogle Scholar
  4. G.A. Eller, "Improving the Quality of Published Chemical Names with Nomenclature Software," Molecules, vol. 11, no. 11, pp. 915- 928, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  5. L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Readings in Speech Recognition, pp. 267-296, Morgan Kaufmann, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept. 1995. Google ScholarGoogle ScholarCross RefCross Ref
  7. A. McCallum, D. Freitag, and F.C.N. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," Proc. 18th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 282-289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I.V. Filippov and M.C. Nicklaus, "Optical Structure Recognition Software to Recover Chemical Information: OSRA, an Open Source Solution." J. Chemical Information and Modeling, vol. 49, no. 3, pp. 740-743, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. Yan, W.S. Spangler, and Y. Chen, "Cross Media Entity Extraction and Linkage for Chemical Documents," Proc. 25th AAAI Conf. Artificial Intelligence (AAAI '11), 2011.Google ScholarGoogle Scholar
  11. D. Weininger, "SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules," J. Chemical Information and Computer Science, vol. 28, no. 1, pp. 31- 36, Feb. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C.M. Friedrich, "Detection of IUPAC and IUPAC-Like Chemical Names," Bioinformatics, vol. 24, pp. 268-276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Corbett, C. Batchelor, and S. Teufel, "Annotation of Chemical Named Entities," Proc. Workshop BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07), pp. 57-64, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Sun, P. Mitra, and C.L. Giles, "Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web," Proc. Int'l Conf. World Wide Web (WWW '08), pp. 735-744, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C.M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck, "Biomedical and Chemical Named Entity Recognition with Conditional Random Fields: The Advantage of Dictionary Features," Proc. Second Int'l Symp. Semantic Mining in Biomedicine (SMBM '06), pp. 85-89, 2006.Google ScholarGoogle Scholar
  16. M. Krallinger, "BioCreAtIve Challenge Evaluation," http:// biocreative.sourceforge.net/, 2013.Google ScholarGoogle Scholar
  17. H.A. Simon, "On a Class of Skew Distribution Functions," Biometrika, vol. 42, nos. 3-4, pp. 425-440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  18. L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith, "Extension of Zipf's Law to Words and Phrases," Proc. 19th Int'l Conf. Computational Linguistics, pp. 1-6, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Biemann, "A Random Text Model for the Generation of Statistical Language Invariants," Proc. Human Language Technologies: The Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, pp. 105-112, Apr. 2007.Google ScholarGoogle Scholar
  20. G.K. Zipf, Human Behavior and the Principle of Least Effort. Martino Fine Books, 1949.Google ScholarGoogle Scholar
  21. A.C. Bulhak, "On the Simulation of Postmodernism and Mental Debility Using Recursive Transition Networks," technical report, 1996.Google ScholarGoogle Scholar
  22. J. Stribling, M. Krohn, and D. Aguayo, "SCIgen--An Automatic CS Paper Generator," http://www.pdos.lcs.mit.edu/scigen/, 2006.Google ScholarGoogle Scholar
  23. N. Chomsky, "Three Models for the Description of Language," IRE Trans. Information Theory, vol. 2, pp. 113-124, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Walter, Structure-Based Approaches to the Indexing and Retrieval of Patent Chemistry. Thomson Reuters, 2010.Google ScholarGoogle Scholar
  25. W.J. Wilbur, G.F. Hazard, G. Divita, J.G. Mork, A.R. Aronson, and A.C. Browne, "Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods," Proc. AMIA Symp., pp. 176-180, 1999.Google ScholarGoogle Scholar
  26. P. Corbett and A. Copestake, "Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition," Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP '08), pp. 54-62, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Sutton and A. Mccallum, Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.Google ScholarGoogle Scholar
  28. N. Okazaki, "CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs)," http://www.chokkan.org/software/ crfsuite/, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader