Abstract
Electronic health records (EHR) provide a rich source of temporal data that present a unique opportunity to characterize disease patterns and risk of imminent disease. While many data-mining tools have been adopted for EHR-based disease early detection, linear discriminant analysis (LDA) is one of the most commonly used statistical methods. However, it is difficult to train an accurate LDA model for early disease diagnosis when too few patients are known to have the target disease. Furthermore, EHR data are heterogeneous with significant noise. In such cases, the covariance matrices used in LDA are usually singular and estimated with a large variance.
This article presents Daehr, an extension of the LDA framework using electronic health record data to address these issues. Beyond existing LDA analyzers, we propose Daehr to (1) eliminate the data noise caused by the manual encoding of EHR data and (2) lower the variance of parameter (covariance matrices) estimation for LDA models when only a few patients’ EHR are available for training. To achieve these two goals, we designed an iterative algorithm to improve the covariance matrix estimation with embedded data-noise/parameter-variance reduction for LDA. We evaluated Daehr extensively using the College Health Surveillance Network, a large, real-world EHR dataset. Specifically, our experiments compared the performance of LDA to three baselines (i.e., LDA and its derivatives) in identifying college students at high risk for mental health disorders from 23 U.S. universities. Experimental results demonstrate Daehr significantly outperforms the three baselines by achieving 1.4%--19.4% higher accuracy and a 7.5%--43.5% higher F1-score.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders
- 2012. CMS: Electronic Health Records. Retrieved from https://www.cms.gov/Medicare/E-health/EHealthRecords/index.html.Google Scholar
- 2015. Any Mental Illness (AMI) Among Adults. NIH National Institute of Mental Health. Retrieved from http://www.nimh.nih.gov/.Google Scholar
- Ruben Amarasingham, Billy J. Moore, Ying P. Tabak, Mark H. Drazner, Christopher A. Clark, Song Zhang, W. Gary Reed, Timothy S. Swanson, Ying Ma, and Ethan A. Halm. 2010. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med. Care 48, 11 (2010), 981--988. Google ScholarCross Ref
- American College Health Association. 2014. American college health association national college health assessment. Spring 2014 Reference Group Executive Summary. Retrieved from http://www.ijme.net/archive/2/communication-training-and-perceived-patient-similarity/.Google Scholar
- James O. Berger. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science 8 Business Media.Google Scholar
- Lev M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 3 (1967), 200--217. Google ScholarCross Ref
- T. Tony Cai and Harrison H. Zhou. 2012. Minimax estimation of large covariance matrices under l1 norm. Stat. Sin. 22, 4 (2012), 1319--1378. Google ScholarCross Ref
- Luca Cazzanti and Maya R. Gupta. 2007. Local similarity discriminant analysis. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY, 137--144. Google ScholarDigital Library
- Ward Cheney and Allen A. Goldstein. 1959. Proximity maps for convex sets. Proc. Am. Math. Soc. 10, 3 (1959), 448--450. Google ScholarCross Ref
- Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. 2011. Sparse discriminant analysis. Technometrics 53, 4 (2011), 406--413. Google Scholar
- Ralph B. D’Agostino, Scott Grundy, Lisa M. Sullivan, and Peter Wilson. 2001. Validation of the Framingham coronary heart disease prediction scores: Results of a multiple ethnic groups investigation. J. Am. Med. Am. 286, 2 (2001), 180--187. Google ScholarCross Ref
- Erik R. Dubberke, Kimberly A. Reske, L. Clifford McDonald, and Victoria J. Fraser. 2006. ICD-9 codes and surveillance for clostridium difficile--associated disease. Emerg. Infect. Dis. 12, 10 (2006), 1576. Google ScholarCross Ref
- René Escalante and Marcos Raydan. 2011. Alternating Projection Methods. Vol. 8. SIAM. Google ScholarDigital Library
- Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188. Google ScholarCross Ref
- Hui Gao and James W. Davis. 2006. Why direct LDA is not equivalent to LDA. Pattern Recogn. 39, 5 (2006), 1002--1006. Google ScholarDigital Library
- E. Gil-Herrera, G. Aden-Buie, A. Yalcin, A. Tsalatsanis, L. E. Barnes, and B. Djulbegovic. 2015. Rough set theory based prognostic classification models for hospice referral. BMC Med. Inform. Dec. Making 15, 1 (2015), 98.Google ScholarCross Ref
- Benjamin A. Goldstein, Ann Marie Navar, Michael J. Pencina, and John P. A. Ioannidis. 2016. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. (2016).Google Scholar
- David Gotz, Fei Wang, and Adam Perer. 2014. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. J. Biomed. Inform. 48 (April 2014), 148--159. Google ScholarDigital Library
- HCUP. 2014. Appendix A - Clinical Classification Software-DIAGNOSES (January 1980 through September 2014). Retrieved from https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt.Google Scholar
- Nicholas J. Higham. 2002. Computing the nearest correlation matrixa problem from finance. IMA J. Numer. Anal. 22, 3 (2002), 329--343. Google ScholarCross Ref
- Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. U.S.A. 33, 2 (1947), 25. Google ScholarCross Ref
- Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma. 2002. Solving the small sample size problem of LDA. In Proceedings of the 16th International Conference on Pattern Recognition, 2002, Vol. 3. IEEE, 29--32. Google ScholarDigital Library
- Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014a. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (2014), 1069--1075.Google ScholarCross Ref
- Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014b. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (Dec. 2014), 1069--1075. Google ScholarCross Ref
- Peter B. Jensen, Lars J. Jensen, and Søren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 13, 6 (2012), 395--405. Google ScholarCross Ref
- Susan Jensen and UK SPSS. 2001. Mining medical data for predictive and sequential patterns: PKDD 2001. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases.Google Scholar
- Jan Kalina, Libor Seidl, Karel Zvára, Hana Grünfeldová, Dalibor Slovák, and Jana Zvárová. 2013. Selecting relevant information for medical decision support with application to cardiology. Eur. J. Biomed. Inform. 9, 1 (2013), 2--6.Google ScholarCross Ref
- Isak Karlsson and Henrik Bostrom. 2014. Handling sparsity with random forests when predicting adverse drug events from electronic health records. In Proceedings of the 2014 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 17--22. Google ScholarDigital Library
- Kenneth S. Kendler, John M. Hettema, Frank Butera, Charles O. Gardner, and Carol A. Prescott. 2003. Life event dimensions of loss, humiliation, entrapment, and danger in the prediction of onsets of major depression and generalized anxiety. Arch. Gen. Psychiat. 60, 8 (2003), 789--796. Google ScholarCross Ref
- Kurt Kroenke and Robert L. Spitzer. 2002. The PHQ-9: A new depression diagnostic and severity measure. Psychiatr. Ann. 32, 9 (2002), 1--7. Google ScholarCross Ref
- Jaana Lindstrom and Jaakko Tuomilehto. 2003. The diabetes risk score: A practical tool to predict type 2 diabetes risk. Diabetes Care 26, 3 (2003), 725--731. Google ScholarCross Ref
- Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. 2015. Temporal phenotyping from longitudinal electronic health records: A graph based framework. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 705--714. Google ScholarDigital Library
- Juwei Lu, Kostantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. 2003. Face recognition using LDA-based algorithms. IEEE Trans. Neur. Netw. 14, 1 (2003), 195--200. Google ScholarDigital Library
- Joo Maroco, Dina Silva, Ana Rodrigues, Manuela Guerreiro, Isabel Santana, and Alexandre de Mendona. 2011. Data mining methods in the prediction of dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res. Notes 4, 1 (Aug. 2011), 299.Google ScholarCross Ref
- Geoffrey McLachlan. 2004. Discriminant Analysis and Statistical Pattern Recognition. Vol. 544. John Wiley 8 Sons.Google Scholar
- S. Mitchell, K. Schinkel, Y. Song, Y. Wang, J. Ainsworth, T. Halbert, S. Strong, J. Zhang, C. C. Moore, and L. E. Barnes. 2016. Optimization of sepsis risk assessment for ward patients. In Proceedings of the IEEE Systems and Information Engineering Design Symposium (SIEDS). 107--112. Google ScholarCross Ref
- Yurii Nesterov. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science 8 Business Media. Google ScholarDigital Library
- Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015a. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summit on Clinical Research Informatics (CRI) (2015), 132.Google Scholar
- Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015b. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits on Translational Science Proceedings 2015 (March 2015), 132--136.Google Scholar
- Alicia Nobles, Ketki Vilankar, Hao Wu, and Laura Barnes. 2015. Evaluation of data quality of multisite electronic health record data for secondary analysis. In Proceedings of the 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarDigital Library
- Adam Perer and Fei Wang. 2014. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces. ACM, 153--162. Google ScholarDigital Library
- Adam Perer, Fei Wang, and Jianying Hu. 2015. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomed. Inform. 56 (Aug. 2015), 369--378. Google ScholarDigital Library
- Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H. Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S. Iversen, Andrew T. Huang, and others. 2004. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl Acad. Sci. U.S.A. 101, 22 (2004), 8431--8436. Google ScholarCross Ref
- Zhihua Qiao, Lan Zhou, and Jianhua Z. Huang. 2008. Effective linear discriminant analysis for high dimensional, low sample size data. In Proceeding of the World Congress on Engineering, Vol. 2. Citeseer, 2--4.Google Scholar
- Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, and others. 2011. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Stat. 39, 2 (2011), 1241--1265. Google ScholarCross Ref
- George C. M. Siontis, Ioanna Tzoulaki, Konstantinos C. Siontis, and John P. A. Ioannidis. 2012. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. Br. Med. J. 344 (2012). Google ScholarCross Ref
- Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Edabollahi. 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explor. Newslett. 14, 1 (2012), 16--24. Google ScholarDigital Library
- Barbara G. Tabachnick, Linda S. Fidell, and others. 2001. Using multivariate statistics. (2001). 530--538.Google Scholar
- James C. Turner and Adrienne Keller. 2015. College health surveillance network: Epidemiology and health care utilization of college students at U.S. 4-year universities. J. Am. College Health: J. ACH (June 2015).Google Scholar
- A. Tylee and P. Gandhi. 2005. The importance of somatic symptoms in depression in primary care. Prim. Care Compan. J. Clin. Psychiatr. 7, 4 (2005), 167--176. Google ScholarCross Ref
- John Von Neumann. 1951. Functional Operators: The Geometry of Orthogonal Spaces. Princeton University Press.Google Scholar
- Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, and Shahram Ebadollahi. 2012a. Towards heterogeneous temporal clinical event pattern discovery: A convolutional approach. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 453--461. Google ScholarDigital Library
- Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, Shahram Ebadollahi, and A. Laine. 2012b. A framework for mining signatures from event sequences and its applications in healthcare data. 272--285. Google ScholarDigital Library
- Fei Wang and Jimeng Sun. 2015. PSF: A unified patient similarity evaluation framework through metric learning with weak supervision. IEEE J. Biomed. Health Inform. 19, 3 (May 2015), 1053--1060. Google ScholarCross Ref
- Fei Wang, Ping Zhang, Xiang Wang, and Jianying Hu. 2014. Clinical risk prediction by exploring high-order feature correlations. In AMIA Annual Symposium Proceedings, Vol. 2014. American Medical Informatics Association, 1170.Google Scholar
- H.-U. Wittchen, S. Mhlig, and Beesdo K. 2003. Mental disorders in primary care. Dialog. Clin. Neurosci. 5, 2 (2003), 115--128.Google ScholarCross Ref
- Hsien-Chung Wu. 2009. The Karush--Kuhn--Tucker optimality conditions in multiobjective programming problems with interval-valued objective functions. Eur. J. Operat. Res. 196, 1 (2009), 49--60. Google ScholarCross Ref
- Lingzhou Xue, Shiqian Ma, and Hui Zou. 2012. Positive-definite 1-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107, 500 (2012), 1480--1491. Google ScholarCross Ref
- Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park. 2004. An optimization criterion for generalized discriminant analysis on undersampled problems. IEEE Trans. Pattern Anal. Mach. Intell. 26, 8 (2004), 982--994. Google ScholarDigital Library
- Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu Chang, and Lei Hua. 2011. Data mining in healthcare and biomedicine: A survey of the literature. J. Med. Syst. 36, 4 (May 2011), 2431--2448. Google ScholarDigital Library
- Jinghe Zhang, Haoyi Xiong, Yu Huang, Hao Wu, Kevin Leach, and Laura E. Barnes. 2015. MSEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data. In 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarDigital Library
- Bichen Zheng, Jinghe Zhang, Sang Won Yoon, Sarah S. Lam, Mohammad Khasawneh, and Srikanth Poranki. 2015. Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst. Appl. 42, 20 (Nov. 2015), 7110--7120. Google ScholarDigital Library
- Eric R. Ziegel. 2003. Modern applied statistics with S. Technometrics 45, 1 (2003), 111.Google ScholarCross Ref
Index Terms
- Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders
Recommendations
Adapting machine learning techniques to censored time-to-event health record data
Display Omitted Right-censored outcomes are common in biomedical prediction problems.We discuss adapting machine learning (ML) algorithms to these outcomes using IPCW.IPCW is a general-purpose approach which can be applied to many ML techniques.ML with ...
Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data
Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected ...
Ensemble learning-based early detection of influenza disease
AbstractAcross the world, the seasonal disease influenza is a respiratory illness that impacts all age groups in many ways. Its symptoms are fever, chills, aches, pains, headaches, fatigue, cough, and weakness. Seasonal influenza can cause mild to severe ...
Comments