ABSTRACT
Text mining in clinical domain is usually more difficult than general domains (e.g. newswire reports and scientific literature) because of the high level of noise in both the corpus and training data for machine learning (ML). A large number of unknown word, non-word and poor grammatical sentences made up the noise in the clinical corpus. Unknown words are usually complex medical vocabularies, misspellings, acronyms and abbreviations where unknown non-words are generally the clinical patterns including scores and measures. This noise produces obstacles in the initial lexical processing step as well as subsequent semantic analysis. Furthermore, the labelled data used to build ML models is very costly to obtain because it requires intensive clinical knowledge from the annotators. And even created by experts, the training examples usually contain errors and inconsistencies due to the variations in human annotators' attentiveness. Clinical domain also suffers from the nature of the imbalanced data distribution problem. These kinds of noise are very popular and potentially affect the overall information extraction performance but they were not carefully investigated in most presented health informatics systems. This paper introduces a general clinical data mining architecture which is potential of addressing all of these challenges using: automatic proof-reading process, trainable finite state pattern recogniser, iterative model development and active learning. The reportability classifier based on this architecture achieved 98.25% sensitivity and 96.14% specificity on an Australian cancer registry's held-out test set and up to 92% of training data provided for supervised ML was saved by active learning.
- Y. Baram, R. El-Yaniv, and K. Luz. Online choice of active learning algorithms. Journal of Machine Learning Research, 5:255--291, 2004. Google ScholarDigital Library
- C. Campbell, N. Cristianini, A. Smola, et al. Query learning with large margin classifiers. In Machine Learning-International Workshop then Conference, pages 111--118, 2000. Google ScholarDigital Library
- W. Chapman, W. Bridewell, P. Hanbury, G. Cooper, and B. Buchanan. A simple algorithm for identifying negated findings and diseases in discharge summaries. Journal of Biomedical Informatics, 34(5):301--310, 2001.Google ScholarCross Ref
- L. Cheng, J. Zheng, G. Savova, and B. Erickson. Discerning tumor status from unstructured mri reports-completeness of information in existing reports and utility of automated natural language processing. Journal of Digital Imaging, 23:119--132, 2010.Google ScholarCross Ref
- B. de Bruijn, C. Cherry, S. Kiritchenko, J. Martin, and X. Zhu. Machine-learned solutions for three stages of clinical information extraction: the state of the art at i2b2 2010. Journal of the American Medical Informatics Association, 18(5):557--562, 2011.Google ScholarCross Ref
- K. Dreyer, M. Kalra, M. Maher, A. Hurier, B. a. Asfaw, T. Schultz, E. Halpern, and J. Thrall. Application of recently developed computer algorithm for automatic classification of unstructured radiology reports: validation study. Radiology, 234(2):323--9, Mar. 2005.Google ScholarCross Ref
- S. Ertekin, J. Huang, L. Bottou, and L. Giles. Learning on the border: active learning in imbalanced data classification. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 127--136. ACM, 2007. Google ScholarDigital Library
- R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin. Liblinear: A library for large linear classification. J. Mach. Learn. Res., 9:1871--1874, June 2008. Google ScholarDigital Library
- C. Friedman, P. Alderson, J. Austin, J. Cimino, and S. Johnson. A general natural-language text processor for clinical radiology. Journal of the American Medical Informatics Association, 1(2):161--174, 1994.Google ScholarCross Ref
- C. Friedman, L. Shagina, Y. Lussier, and G. Hripcsak. Automated encoding of clinical documents based on natural language processing. Journal of the American Medical Informatics Association, 11(5):392--402, 2004.Google ScholarCross Ref
- B. Haddow. Using automated feature optimisation to create an adaptable relation extraction system. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 19--27, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- P. Haug, S. Koehler, L. Lau, P. Wang, R. Rocha, and S. Huff. Experience with a mixed semantic/syntactic parser. page 284, 1995.Google Scholar
- D. Hochbaum and D. Shmoys. A best possible heuristic for the k-center problem. Mathematics of operations research, 10(2):180--184, 1985. Google ScholarDigital Library
- T. Joachims. Text categorization with support vector machines: Learning with many relevant features. Machine learning: ECML-98, pages 137--142, 1998. Google ScholarDigital Library
- D. Johnson, R. Taira, A. Cardenas, and D. Aberle. Extracting information from free text radiology reports. International Journal on Digital Libraries, 1(3):297--308, Dec. 1997. Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.Google ScholarDigital Library
- K. Lee, Y. Hwang, and H. Rim. Two-phase biomedical ne recognition based on svms. In Proceedings of the ACL 2003 workshop on Natural language processing in biomedicine - Volume 13, BioMed '03, pages 33--40, Stroudsburg, PA, USA, 2003. Association for Computational Linguistics. Google ScholarDigital Library
- D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the eleventh international conference on machine learning, pages 148--156, 1994.Google ScholarCross Ref
- D. Li, K. Kipper-Schuler, and G. Savova. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts. In Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, BioNLP '08, pages 94--95, Stroudsburg, PA, USA, 2008. Association for Computational Linguistics. Google ScholarDigital Library
- I. McCowan and D. Moore. Collection of Cancer Stage Data by Classifying Free-text Medical Reports. Journal of American Medical Informatics Association, pages 736--745, 2007.Google Scholar
- I. McCowan, D. Moore, and M. Fry. Classification of cancer stage from free-text histology reports. Conference proceedings : Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 1:5153--6, Jan. 2006.Google ScholarCross Ref
- D. Nguyen and J. Patrick. Reverse active learning for optimising information extraction training production. In AI 2012: Advances in Artificial Intelligence, pages 445--456. Springer, 2012. Google ScholarDigital Library
- F. Olsson and K. Tomanek. An intrinsic stopping criterion for committee-based active learning. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL '09, pages 138--146, Stroudsburg, PA, USA, 2009. Association for Computational Linguistics. Google ScholarDigital Library
- T. Osugi, D. Kun, and S. Scott. Balancing exploration and exploitation: A new algorithm for active machine learning. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 330--337, 2005. Google ScholarDigital Library
- J. Patrick and D. Nguyen. Automated proof reading of clinical notes. In 25th Pacific Asia Conference on Language, Information and Computation (PACLIC 25), pages 303--312. aclweb, 2011.Google Scholar
- J. Patrick and M. Sabbagh. An active learning process for extraction and standardisation of medical measurements by a trainable fsa. In A. Gelbukh, editor, Computational Linguistics and Intelligent Text Processing, volume 6609 of Lecture Notes in Computer Science, pages 151--162. Springer Berlin / Heidelberg, 2011. Google ScholarDigital Library
- J. Patrick, M. Sabbagh, S. Jain, and H. Zheng. Spelling correction in clinical notes with emphasis on first suggestion accuracy. 2nd Workshop on Building and Evaluating Re-sources for Biomedical Text Mining, pages 2--8, 2010.Google Scholar
- G. Schohn and D. Cohn. Less is more: Active learning with support vector machines. In Machine Learning-International Workshop then Conference, pages 839--846. Citeseer, 2000. Google ScholarDigital Library
- B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin--Madison, 2009.Google Scholar
- R. Taira. Automatic Structuring of Radiology Free-Text Reports. Radiographics, 98105:237--245, 2001.Google ScholarCross Ref
- B. Thomas, H. Ouellette, E. Halpern, and D. Rosenthal. Automated computer-assisted categorization of radiology reports. American Journal of Roentgenology, 184(2):687--690, 2005.Google ScholarCross Ref
- S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:45--66, 2002. Google ScholarDigital Library
- Y. Tsuruoka, Y. Tateishi, J. Kim, T. Ohta, J. McNaught, S. Ananiadou, and J. Tsujii. Developing a robust part-of-speech tagger for biomedical text. Advances in informatics, pages 382--392, 2005. Google ScholarDigital Library
- O. Uzuner, B. South, S. Shen, and S. DuVall. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association, 18(5):552--556, 2011.Google Scholar
- O. Uzuner, X. Zhang, and T. Sibanda. Machine learning and rule-based approaches to assertion classification. Journal of the American Medical Informatics Association, 16(1):109--115, 2009.Google ScholarCross Ref
- A. Vlachos. A stopping criterion for active learning. Computer Speech & Language, 22(3):295--312, 2008. Google ScholarDigital Library
Index Terms
- Text Mining in Clinical Domain: Dealing with Noise
Recommendations
Multi-prototype Morpheme Embedding for Text Classification
SMA 2020: The 9th International Conference on Smart Media and ApplicationsRepresenting a word into a continuous space, also known as a word vector, has been successful in various NLP tasks. The word-based embedding has two problems; one is the out-of-vocabulary problem and the other is does not take into account the context ...
Unsupervised concept extraction from clinical text through semantic composition
Graphical abstractDisplay Omitted
Highlights- Unsupervised concept extraction from clinical text.
- Uses semantic ...
AbstractConcept extraction is an important step in clinical natural language processing. Once extracted, the use of concepts can improve the accuracy and generalization of downstream systems. We present a new unsupervised system for the ...
Using lexical disambiguation and named-entity recognition to improve spelling correction in the electronic patient record
In this article, we show how a set of natural language processing (NLP) tools can be combined to improve the processing of clinical records. The study concentrates on improving spelling correction, which is of major importance for quality control in the ...
Comments