research-article

Text Mining in Clinical Domain: Dealing with Noise

Authors:
Hoang Nguyen

Data61 - CSIRO, Sydney, Australia

Data61 - CSIRO, Sydney, Australia
View Profile

,
Jon Patrick

University of Sydney, Sydney, Australia

University of Sydney, Sydney, Australia
View Profile

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2016Pages 549–558https://doi.org/10.1145/2939672.2939720

Published:13 August 2016Publication History

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 549–558

ABSTRACT

Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.

References

Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255--291, 2004. Google ScholarDigital Library
C. Campbell, N. Cristianini, A. Smola, et al. Query learning with large margin classifiers. In Machine Learning-International Workshop then Conference, pages 111--118, 2000. Google ScholarDigital Library
W. Chapman, W. Bridewell, P. Hanbury, G. Cooper, and B. Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5):301--310, 2001.Google ScholarCross Ref
L. Cheng, J. Zheng, G. Savova, and B. Erickson. Discerning tumor status from unstructured mri reports-completeness of information in existing reports and utility of automated natural language processing. Journal of Digital Imaging, 23:119--132, 2010.Google ScholarCross Ref
B. de Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. Journal of the American Medical Informatics Association, 18(5):557--562, 2011.Google ScholarCross Ref
K. Dreyer, M. Kalra, M. Maher, A. Hurier, B. a. Asfaw, T. Schultz, E. Halpern, and J. Thrall. Application of recently developed computer algorithm for automatic classification of unstructured radiology reports: validation study. Radiology, 234(2):323--9, Mar. 2005.Google ScholarCross Ref
S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 127--136. ACM, 2007. Google ScholarDigital Library
R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, June 2008. Google ScholarDigital Library
C. Friedman, P. Alderson, J. Austin, J. Cimino, and S. Johnson. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, 1(2):161--174, 1994.Google ScholarCross Ref
C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association, 11(5):392--402, 2004.Google ScholarCross Ref
B. Haddow. Using automated feature optimisation to create an adaptable relation extraction system. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 19--27, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
P. Haug, S. Koehler, L. Lau, P. Wang, R. Rocha, and S. Huff. Experience with a mixed semantic/syntactic parser. page 284, 1995.Google Scholar
D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180--184, 1985. Google ScholarDigital Library
T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Machine learning: ECML-98, pages 137--142, 1998. Google ScholarDigital Library
D. Johnson, R. Taira, A. Cardenas, and D. Aberle. Extracting information from free text radiology reports. International Journal on Digital Libraries, 1(3):297--308, Dec. 1997. Google ScholarDigital Library
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.Google ScholarDigital Library
K. Lee, Y. Hwang, and H. Rim. Two-phase biomedical ne recognition based on svms. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13, BioMed '03, pages 33--40, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the eleventh international conference on machine learning, pages 148--156, 1994.Google ScholarCross Ref
D. Li, K. Kipper-Schuler, and G. Savova. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 94--95, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
I. McCowan and D. Moore. Collection of Cancer Stage Data by Classifying Free-text Medical Reports. Journal of American Medical Informatics Association, pages 736--745, 2007.Google Scholar
I. McCowan, D. Moore, and M. Fry. Classification of cancer stage from free-text histology reports. Conference proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1:5153--6, Jan. 2006.Google ScholarCross Ref
D. Nguyen and J. Patrick. Reverse active learning for optimising information extraction training production. In AI 2012: Advances in Artificial Intelligence, pages 445--456. Springer, 2012. Google ScholarDigital Library
F. Olsson and K. Tomanek. An intrinsic stopping criterion for committee-based active learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL '09, pages 138--146, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. Google ScholarDigital Library
T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 330--337, 2005. Google ScholarDigital Library
J. Patrick and D. Nguyen. Automated proof reading of clinical notes. In 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25), pages 303--312. aclweb, 2011.Google Scholar
J. Patrick and M. Sabbagh. An active learning process for extraction and standardisation of medical measurements by a trainable fsa. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6609 of Lecture Notes in Computer Science, pages 151--162. Springer Berlin / Heidelberg, 2011. Google ScholarDigital Library
J. Patrick, M. Sabbagh, S. Jain, and H. Zheng. Spelling correction in clinical notes with emphasis on first suggestion accuracy. 2nd Workshop on Building and Evaluating Re-sources for Biomedical Text Mining, pages 2--8, 2010.Google Scholar
G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Machine Learning-International Workshop then Conference, pages 839--846. Citeseer, 2000. Google ScholarDigital Library
B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.Google Scholar
R. Taira. Automatic Structuring of Radiology Free-Text Reports. Radiographics, 98105:237--245, 2001.Google ScholarCross Ref
B. Thomas, H. Ouellette, E. Halpern, and D. Rosenthal. Automated computer-assisted categorization of radiology reports. American Journal of Roentgenology, 184(2):687--690, 2005.Google ScholarCross Ref
S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2002. Google ScholarDigital Library
Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-of-speech tagger for biomedical text. Advances in informatics, pages 382--392, 2005. Google ScholarDigital Library
O. Uzuner, B. South, S. Shen, and S. DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552--556, 2011.Google Scholar
O. Uzuner, X. Zhang, and T. Sibanda. Machine learning and rule-based approaches to assertion classification. Journal of the American Medical Informatics Association, 16(1):109--115, 2009.Google ScholarCross Ref
A. Vlachos. A stopping criterion for active learning. Computer Speech & Language, 22(3):295--312, 2008. Google ScholarDigital Library

Index Terms

Text Mining in Clinical Domain: Dealing with Noise
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Learning settings
      1. Active learning settings
2. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and Applications

Representing a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Read More
Unsupervised concept extraction from clinical text through semantic composition
Graphical abstract

Display Omitted
Highlights
- Unsupervised concept extraction from clinical text.
- Uses semantic ...
Abstract
Concept extraction is an important step in clinical natural language processing. Once extracted, the use of concepts can improve the accuracy and generalization of downstream systems. We present a new unsupervised system for the ...
Read More
Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record

In this article, we show how a set of natural language processing (NLP) tools can be combined to improve the processing of clinical records. The study concentrates on improving spelling correction, which is of major importance for quality control in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2016
2176 pages
ISBN:9781450342322
DOI:10.1145/2939672
General Chairs:
Balaji Krishnapuram
IBM
,
Mohak Shah
Bosch
,
Program Chairs:
Alex Smola
Amazon
,
Charu Aggarwal
IBM
,
Dou Shen
Baidu
,
Rajeev Rastogi
Amazon
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
clinical
named-entity recognition
natural languages processing
text classification
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 403
  Total Downloads
- Downloads (Last 12 months)31
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Text Mining in Clinical Domain: Dealing with Noise

KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Multi-prototype Morpheme Embedding for Text Classification

Unsupervised concept extraction from clinical text through semantic composition

Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record