skip to main content
10.1145/2939672.2939720acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Text Mining in Clinical Domain: Dealing with Noise

Published:13 August 2016Publication History

ABSTRACT

Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.

References

  1. Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255--291, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Campbell, N. Cristianini, A. Smola, et al. Query learning with large margin classifiers. In Machine Learning-International Workshop then Conference, pages 111--118, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. W. Chapman, W. Bridewell, P. Hanbury, G. Cooper, and B. Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5):301--310, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  4. L. Cheng, J. Zheng, G. Savova, and B. Erickson. Discerning tumor status from unstructured mri reports-completeness of information in existing reports and utility of automated natural language processing. Journal of Digital Imaging, 23:119--132, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. B. de Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. Journal of the American Medical Informatics Association, 18(5):557--562, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Dreyer, M. Kalra, M. Maher, A. Hurier, B. a. Asfaw, T. Schultz, E. Halpern, and J. Thrall. Application of recently developed computer algorithm for automatic classification of unstructured radiology reports: validation study. Radiology, 234(2):323--9, Mar. 2005.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 127--136. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, June 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Friedman, P. Alderson, J. Austin, J. Cimino, and S. Johnson. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, 1(2):161--174, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  10. C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association, 11(5):392--402, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Haddow. Using automated feature optimisation to create an adaptable relation extraction system. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 19--27, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Haug, S. Koehler, L. Lau, P. Wang, R. Rocha, and S. Huff. Experience with a mixed semantic/syntactic parser. page 284, 1995.Google ScholarGoogle Scholar
  13. D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180--184, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Machine learning: ECML-98, pages 137--142, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Johnson, R. Taira, A. Cardenas, and D. Aberle. Extracting information from free text radiology reports. International Journal on Digital Libraries, 1(3):297--308, Dec. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Lee, Y. Hwang, and H. Rim. Two-phase biomedical ne recognition based on svms. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13, BioMed '03, pages 33--40, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the eleventh international conference on machine learning, pages 148--156, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  19. D. Li, K. Kipper-Schuler, and G. Savova. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 94--95, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. I. McCowan and D. Moore. Collection of Cancer Stage Data by Classifying Free-text Medical Reports. Journal of American Medical Informatics Association, pages 736--745, 2007.Google ScholarGoogle Scholar
  21. I. McCowan, D. Moore, and M. Fry. Classification of cancer stage from free-text histology reports. Conference proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1:5153--6, Jan. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. D. Nguyen and J. Patrick. Reverse active learning for optimising information extraction training production. In AI 2012: Advances in Artificial Intelligence, pages 445--456. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. F. Olsson and K. Tomanek. An intrinsic stopping criterion for committee-based active learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL '09, pages 138--146, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 330--337, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Patrick and D. Nguyen. Automated proof reading of clinical notes. In 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25), pages 303--312. aclweb, 2011.Google ScholarGoogle Scholar
  26. J. Patrick and M. Sabbagh. An active learning process for extraction and standardisation of medical measurements by a trainable fsa. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6609 of Lecture Notes in Computer Science, pages 151--162. Springer Berlin / Heidelberg, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Patrick, M. Sabbagh, S. Jain, and H. Zheng. Spelling correction in clinical notes with emphasis on first suggestion accuracy. 2nd Workshop on Building and Evaluating Re-sources for Biomedical Text Mining, pages 2--8, 2010.Google ScholarGoogle Scholar
  28. G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Machine Learning-International Workshop then Conference, pages 839--846. Citeseer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.Google ScholarGoogle Scholar
  30. R. Taira. Automatic Structuring of Radiology Free-Text Reports. Radiographics, 98105:237--245, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  31. B. Thomas, H. Ouellette, E. Halpern, and D. Rosenthal. Automated computer-assisted categorization of radiology reports. American Journal of Roentgenology, 184(2):687--690, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  32. S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-of-speech tagger for biomedical text. Advances in informatics, pages 382--392, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. O. Uzuner, B. South, S. Shen, and S. DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552--556, 2011.Google ScholarGoogle Scholar
  35. O. Uzuner, X. Zhang, and T. Sibanda. Machine learning and rule-based approaches to assertion classification. Journal of the American Medical Informatics Association, 16(1):109--115, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  36. A. Vlachos. A stopping criterion for active learning. Computer Speech & Language, 22(3):295--312, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Text Mining in Clinical Domain: Dealing with Noise

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
            August 2016
            2176 pages
            ISBN:9781450342322
            DOI:10.1145/2939672

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 August 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader