Abstract
The automation of extracting chemical names from text has significant value to biomedical and life science research. A major barrier in this task is the difficulty of getting a sizable and good quality data to train a reliable entity extraction model. Another difficulty is the selection of informative features of chemical names, since comprehensive domain knowledge on chemistry nomenclature is required. Leveraging random text generation techniques, we explore the idea of automatically creating training sets for the task of chemical name extraction. Assuming the availability of an incomplete list of chemical names, called a dictionary, we are able to generate well-controlled, random, yet realistic chemical-like training documents. We statistically analyze the construction of chemical names based on the incomplete dictionary, and propose a series of new features, without relying on any domain knowledge. Compared to state-of-the-art models learned from manually labeled data and domain knowledge, our solution shows better or comparable results in annotating real-world data with less human effort. Moreover, we report an interesting observation about the language for chemical names. That is, both the structural and semantic components of chemical names follow a Zipfian distribution, which resembles many natural languages.
- K. Pastra, D. Maynard, H. Cunningham, O. Hamza, and Y. Wilks, "How Feasible Is the Reuse of Grammars for Named Entity Recognition?" Proc. Third Language Resources and Evaluation Conf., 2002.Google Scholar
- K. Chakrabarti, S. Chaudhuri, V. Ganti, and D. Xin, "An Efficient Filter for Approximate Membership Checking," Proc. ACM SIGMOD Int'l Conf. Management of Data, pp. 805-818, 2008. Google ScholarDigital Library
- I. U. of Pure, A. C. C. on the Nomenclature of Organic Chemistry, R. Panico, W. Powell, and J. Richer, A Guide to IUPAC Nomenclature of Organic Compounds: Recommendations 1993, IUPAC Chemical Data Series, 1993.Google Scholar
- G.A. Eller, "Improving the Quality of Published Chemical Names with Nomenclature Software," Molecules, vol. 11, no. 11, pp. 915- 928, 2006.Google ScholarCross Ref
- L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Readings in Speech Recognition, pp. 267-296, Morgan Kaufmann, 1990. Google ScholarDigital Library
- C. Cortes and V. Vapnik, "Support-Vector Networks," Machine Learning, vol. 20, no. 3, pp. 273-297, Sept. 1995. Google ScholarCross Ref
- A. McCallum, D. Freitag, and F.C.N. Pereira, "Maximum Entropy Markov Models for Information Extraction and Segmentation," Proc. 18th Int'l Conf. Machine Learning (ICML '00), pp. 591-598, 2000. Google ScholarDigital Library
- J.D. Lafferty, A. McCallum, and F.C.N. Pereira, "Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data," Proc. 18th Int'l Conf. Machine Learning (ICML '01), pp. 282-289, 2001. Google ScholarDigital Library
- I.V. Filippov and M.C. Nicklaus, "Optical Structure Recognition Software to Recover Chemical Information: OSRA, an Open Source Solution." J. Chemical Information and Modeling, vol. 49, no. 3, pp. 740-743, 2009.Google ScholarCross Ref
- S. Yan, W.S. Spangler, and Y. Chen, "Cross Media Entity Extraction and Linkage for Chemical Documents," Proc. 25th AAAI Conf. Artificial Intelligence (AAAI '11), 2011.Google Scholar
- D. Weininger, "SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules," J. Chemical Information and Computer Science, vol. 28, no. 1, pp. 31- 36, Feb. 1988. Google ScholarDigital Library
- R. Klinger, C. Kolárik, J. Fluck, M. Hofmann-Apitius, and C.M. Friedrich, "Detection of IUPAC and IUPAC-Like Chemical Names," Bioinformatics, vol. 24, pp. 268-276, 2008. Google ScholarDigital Library
- P. Corbett, C. Batchelor, and S. Teufel, "Annotation of Chemical Named Entities," Proc. Workshop BioNLP 2007: Biological, Translational, and Clinical Language Processing (BioNLP '07), pp. 57-64, 2007. Google ScholarDigital Library
- B. Sun, P. Mitra, and C.L. Giles, "Mining, Indexing, and Searching for Textual Chemical Molecule Information on the Web," Proc. Int'l Conf. World Wide Web (WWW '08), pp. 735-744, 2008. Google ScholarDigital Library
- C.M. Friedrich, T. Revillion, M. Hofmann, and J. Fluck, "Biomedical and Chemical Named Entity Recognition with Conditional Random Fields: The Advantage of Dictionary Features," Proc. Second Int'l Symp. Semantic Mining in Biomedicine (SMBM '06), pp. 85-89, 2006.Google Scholar
- M. Krallinger, "BioCreAtIve Challenge Evaluation," http:// biocreative.sourceforge.net/, 2013.Google Scholar
- H.A. Simon, "On a Class of Skew Distribution Functions," Biometrika, vol. 42, nos. 3-4, pp. 425-440, 1955.Google ScholarCross Ref
- L.Q. Ha, E.I. Sicilia-Garcia, J. Ming, and F.J. Smith, "Extension of Zipf's Law to Words and Phrases," Proc. 19th Int'l Conf. Computational Linguistics, pp. 1-6, 2002. Google ScholarDigital Library
- C. Biemann, "A Random Text Model for the Generation of Statistical Language Invariants," Proc. Human Language Technologies: The Ann. Conf. North Am. Chapter of the Assoc. for Computational Linguistics, pp. 105-112, Apr. 2007.Google Scholar
- G.K. Zipf, Human Behavior and the Principle of Least Effort. Martino Fine Books, 1949.Google Scholar
- A.C. Bulhak, "On the Simulation of Postmodernism and Mental Debility Using Recursive Transition Networks," technical report, 1996.Google Scholar
- J. Stribling, M. Krohn, and D. Aguayo, "SCIgen--An Automatic CS Paper Generator," http://www.pdos.lcs.mit.edu/scigen/, 2006.Google Scholar
- N. Chomsky, "Three Models for the Description of Language," IRE Trans. Information Theory, vol. 2, pp. 113-124, 1956.Google ScholarCross Ref
- D. Walter, Structure-Based Approaches to the Indexing and Retrieval of Patent Chemistry. Thomson Reuters, 2010.Google Scholar
- W.J. Wilbur, G.F. Hazard, G. Divita, J.G. Mork, A.R. Aronson, and A.C. Browne, "Analysis of Biomedical Text for Chemical Names: A Comparison of Three Methods," Proc. AMIA Symp., pp. 176-180, 1999.Google Scholar
- P. Corbett and A. Copestake, "Cascaded Classifiers for Confidence-Based Chemical Named Entity Recognition," Proc. Workshop Current Trends in Biomedical Natural Language Processing (BioNLP '08), pp. 54-62, 2008. Google ScholarDigital Library
- C. Sutton and A. Mccallum, Introduction to Conditional Random Fields for Relational Learning. MIT Press, 2006.Google Scholar
- N. Okazaki, "CRFsuite: A Fast Implementation of Conditional Random Fields (CRFs)," http://www.chokkan.org/software/ crfsuite/, 2007.Google Scholar
Index Terms
- Chemical Name Extraction Based on Automatic Training Data Generation and Rich Feature Set
Recommendations
Rich set of features for proper name recognition in polish texts
SIIS'11: Proceedings of the 2011 international conference on Security and Intelligent Information SystemsIn this paper we analyse the importance of data generalisation and usage of local context in the problem of the Proper Name recognition. We present an extended set of features that provide generalised description of the data and encode linguistic ...
Two learning approaches for protein name extraction
Protein name extraction, one of the basic tasks in automatic extraction of information from biological texts, remains challenging. In this paper, we explore the use of two different machine learning techniques and present the results of the conducted ...
Identification of Chemical Entities in Patent Documents
IWANN '09: Proceedings of the 10th International Work-Conference on Artificial Neural Networks: Part II: Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted LivingBiomedical literature is an important source of information for chemical compounds. However, different representations and nomenclatures for chemical entities exist, which makes the reference of chemical entities ambiguous. Many systems already exist ...
Comments