research-article

Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

Authors:
Haoyi Xiong

Missouri University of Science and Technology, Missouri, USA

Missouri University of Science and Technology, Missouri, USA

0000-0002-5451-3253
View Profile

,
Jinghe Zhang

University of Virginia, VA, USA

University of Virginia, VA, USA
View Profile

,
Yu Huang

University of Virginia, VA, USA

University of Virginia, VA, USA
View Profile

,
Kevin Leach

University of Virginia, VA, USA

University of Virginia, VA, USA
View Profile

,
Laura E. Barnes

University of Virginia, VA, USA

University of Virginia, VA, USA
View Profile

ACM Transactions on Intelligent Systems and Technology Volume 8 Issue 3Article No.: 47pp 1–21https://doi.org/10.1145/3007195

Published:08 February 2017Publication History

ACM Transactions on Intelligent Systems and Technology

Abstract

Electronic health records (EHR) provide a rich source of temporal data that present a unique opportunity to characterize disease patterns and risk of imminent disease. While many data-mining tools have been adopted for EHR-based disease early detection, linear discriminant analysis (LDA) is one of the most commonly used statistical methods. However, it is difficult to train an accurate LDA model for early disease diagnosis when too few patients are known to have the target disease. Furthermore, EHR data are heterogeneous with significant noise. In such cases, the covariance matrices used in LDA are usually singular and estimated with a large variance.

This article presents Daehr, an extension of the LDA framework using electronic health record data to address these issues. Beyond existing LDA analyzers, we propose Daehr to (1) eliminate the data noise caused by the manual encoding of EHR data and (2) lower the variance of parameter (covariance matrices) estimation for LDA models when only a few patients’ EHR are available for training. To achieve these two goals, we designed an iterative algorithm to improve the covariance matrix estimation with embedded data-noise/parameter-variance reduction for LDA. We evaluated Daehr extensively using the College Health Surveillance Network, a large, real-world EHR dataset. Specifically, our experiments compared the performance of LDA to three baselines (i.e., LDA and its derivatives) in identifying college students at high risk for mental health disorders from 23 U.S. universities. Experimental results demonstrate Daehr significantly outperforms the three baselines by achieving 1.4%--19.4% higher accuracy and a 7.5%--43.5% higher F1-score.

Supplemental Material

Available for Download

zip

xiong.zip (93.2 KB)

Supplemental movie, appendix, image and software files for, Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

References

2012. CMS: Electronic Health Records. Retrieved from https://www.cms.gov/Medicare/E-health/EHealthRecords/index.html.Google Scholar
2015. Any Mental Illness (AMI) Among Adults. NIH National Institute of Mental Health. Retrieved from http://www.nimh.nih.gov/.Google Scholar
Ruben Amarasingham, Billy J. Moore, Ying P. Tabak, Mark H. Drazner, Christopher A. Clark, Song Zhang, W. Gary Reed, Timothy S. Swanson, Ying Ma, and Ethan A. Halm. 2010. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med. Care 48, 11 (2010), 981--988. Google ScholarCross Ref
American College Health Association. 2014. American college health association national college health assessment. Spring 2014 Reference Group Executive Summary. Retrieved from http://www.ijme.net/archive/2/communication-training-and-perceived-patient-similarity/.Google Scholar
James O. Berger. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science 8 Business Media.Google Scholar
Lev M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 3 (1967), 200--217. Google ScholarCross Ref
T. Tony Cai and Harrison H. Zhou. 2012. Minimax estimation of large covariance matrices under l1 norm. Stat. Sin. 22, 4 (2012), 1319--1378. Google ScholarCross Ref
Luca Cazzanti and Maya R. Gupta. 2007. Local similarity discriminant analysis. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY, 137--144. Google ScholarDigital Library
Ward Cheney and Allen A. Goldstein. 1959. Proximity maps for convex sets. Proc. Am. Math. Soc. 10, 3 (1959), 448--450. Google ScholarCross Ref
Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. 2011. Sparse discriminant analysis. Technometrics 53, 4 (2011), 406--413. Google Scholar
Ralph B. D’Agostino, Scott Grundy, Lisa M. Sullivan, and Peter Wilson. 2001. Validation of the Framingham coronary heart disease prediction scores: Results of a multiple ethnic groups investigation. J. Am. Med. Am. 286, 2 (2001), 180--187. Google ScholarCross Ref
Erik R. Dubberke, Kimberly A. Reske, L. Clifford McDonald, and Victoria J. Fraser. 2006. ICD-9 codes and surveillance for clostridium difficile--associated disease. Emerg. Infect. Dis. 12, 10 (2006), 1576. Google ScholarCross Ref
René Escalante and Marcos Raydan. 2011. Alternating Projection Methods. Vol. 8. SIAM. Google ScholarDigital Library
Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188. Google ScholarCross Ref
Hui Gao and James W. Davis. 2006. Why direct LDA is not equivalent to LDA. Pattern Recogn. 39, 5 (2006), 1002--1006. Google ScholarDigital Library
E. Gil-Herrera, G. Aden-Buie, A. Yalcin, A. Tsalatsanis, L. E. Barnes, and B. Djulbegovic. 2015. Rough set theory based prognostic classification models for hospice referral. BMC Med. Inform. Dec. Making 15, 1 (2015), 98.Google ScholarCross Ref
Benjamin A. Goldstein, Ann Marie Navar, Michael J. Pencina, and John P. A. Ioannidis. 2016. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. (2016).Google Scholar
David Gotz, Fei Wang, and Adam Perer. 2014. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. J. Biomed. Inform. 48 (April 2014), 148--159. Google ScholarDigital Library
HCUP. 2014. Appendix A - Clinical Classification Software-DIAGNOSES (January 1980 through September 2014). Retrieved from https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt.Google Scholar
Nicholas J. Higham. 2002. Computing the nearest correlation matrixa problem from finance. IMA J. Numer. Anal. 22, 3 (2002), 329--343. Google ScholarCross Ref
Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. U.S.A. 33, 2 (1947), 25. Google ScholarCross Ref
Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma. 2002. Solving the small sample size problem of LDA. In Proceedings of the 16th International Conference on Pattern Recognition, 2002, Vol. 3. IEEE, 29--32. Google ScholarDigital Library
Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014a. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (2014), 1069--1075.Google ScholarCross Ref
Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014b. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (Dec. 2014), 1069--1075. Google ScholarCross Ref
Peter B. Jensen, Lars J. Jensen, and Søren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 13, 6 (2012), 395--405. Google ScholarCross Ref
Susan Jensen and UK SPSS. 2001. Mining medical data for predictive and sequential patterns: PKDD 2001. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases.Google Scholar
Jan Kalina, Libor Seidl, Karel Zvára, Hana Grünfeldová, Dalibor Slovák, and Jana Zvárová. 2013. Selecting relevant information for medical decision support with application to cardiology. Eur. J. Biomed. Inform. 9, 1 (2013), 2--6.Google ScholarCross Ref
Isak Karlsson and Henrik Bostrom. 2014. Handling sparsity with random forests when predicting adverse drug events from electronic health records. In Proceedings of the 2014 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 17--22. Google ScholarDigital Library
Kenneth S. Kendler, John M. Hettema, Frank Butera, Charles O. Gardner, and Carol A. Prescott. 2003. Life event dimensions of loss, humiliation, entrapment, and danger in the prediction of onsets of major depression and generalized anxiety. Arch. Gen. Psychiat. 60, 8 (2003), 789--796. Google ScholarCross Ref
Kurt Kroenke and Robert L. Spitzer. 2002. The PHQ-9: A new depression diagnostic and severity measure. Psychiatr. Ann. 32, 9 (2002), 1--7. Google ScholarCross Ref
Jaana Lindstrom and Jaakko Tuomilehto. 2003. The diabetes risk score: A practical tool to predict type 2 diabetes risk. Diabetes Care 26, 3 (2003), 725--731. Google ScholarCross Ref
Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. 2015. Temporal phenotyping from longitudinal electronic health records: A graph based framework. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 705--714. Google ScholarDigital Library
Juwei Lu, Kostantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. 2003. Face recognition using LDA-based algorithms. IEEE Trans. Neur. Netw. 14, 1 (2003), 195--200. Google ScholarDigital Library
Joo Maroco, Dina Silva, Ana Rodrigues, Manuela Guerreiro, Isabel Santana, and Alexandre de Mendona. 2011. Data mining methods in the prediction of dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res. Notes 4, 1 (Aug. 2011), 299.Google ScholarCross Ref
Geoffrey McLachlan. 2004. Discriminant Analysis and Statistical Pattern Recognition. Vol. 544. John Wiley 8 Sons.Google Scholar
S. Mitchell, K. Schinkel, Y. Song, Y. Wang, J. Ainsworth, T. Halbert, S. Strong, J. Zhang, C. C. Moore, and L. E. Barnes. 2016. Optimization of sepsis risk assessment for ward patients. In Proceedings of the IEEE Systems and Information Engineering Design Symposium (SIEDS). 107--112. Google ScholarCross Ref
Yurii Nesterov. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science 8 Business Media. Google ScholarDigital Library
Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015a. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summit on Clinical Research Informatics (CRI) (2015), 132.Google Scholar
Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015b. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits on Translational Science Proceedings 2015 (March 2015), 132--136.Google Scholar
Alicia Nobles, Ketki Vilankar, Hao Wu, and Laura Barnes. 2015. Evaluation of data quality of multisite electronic health record data for secondary analysis. In Proceedings of the 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarDigital Library
Adam Perer and Fei Wang. 2014. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces. ACM, 153--162. Google ScholarDigital Library
Adam Perer, Fei Wang, and Jianying Hu. 2015. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomed. Inform. 56 (Aug. 2015), 369--378. Google ScholarDigital Library
Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H. Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S. Iversen, Andrew T. Huang, and others. 2004. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl Acad. Sci. U.S.A. 101, 22 (2004), 8431--8436. Google ScholarCross Ref
Zhihua Qiao, Lan Zhou, and Jianhua Z. Huang. 2008. Effective linear discriminant analysis for high dimensional, low sample size data. In Proceeding of the World Congress on Engineering, Vol. 2. Citeseer, 2--4.Google Scholar
Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, and others. 2011. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Stat. 39, 2 (2011), 1241--1265. Google ScholarCross Ref
George C. M. Siontis, Ioanna Tzoulaki, Konstantinos C. Siontis, and John P. A. Ioannidis. 2012. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. Br. Med. J. 344 (2012). Google ScholarCross Ref
Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Edabollahi. 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explor. Newslett. 14, 1 (2012), 16--24. Google ScholarDigital Library
Barbara G. Tabachnick, Linda S. Fidell, and others. 2001. Using multivariate statistics. (2001). 530--538.Google Scholar
James C. Turner and Adrienne Keller. 2015. College health surveillance network: Epidemiology and health care utilization of college students at U.S. 4-year universities. J. Am. College Health: J. ACH (June 2015).Google Scholar
A. Tylee and P. Gandhi. 2005. The importance of somatic symptoms in depression in primary care. Prim. Care Compan. J. Clin. Psychiatr. 7, 4 (2005), 167--176. Google ScholarCross Ref
John Von Neumann. 1951. Functional Operators: The Geometry of Orthogonal Spaces. Princeton University Press.Google Scholar
Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, and Shahram Ebadollahi. 2012a. Towards heterogeneous temporal clinical event pattern discovery: A convolutional approach. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 453--461. Google ScholarDigital Library
Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, Shahram Ebadollahi, and A. Laine. 2012b. A framework for mining signatures from event sequences and its applications in healthcare data. 272--285. Google ScholarDigital Library
Fei Wang and Jimeng Sun. 2015. PSF: A unified patient similarity evaluation framework through metric learning with weak supervision. IEEE J. Biomed. Health Inform. 19, 3 (May 2015), 1053--1060. Google ScholarCross Ref
Fei Wang, Ping Zhang, Xiang Wang, and Jianying Hu. 2014. Clinical risk prediction by exploring high-order feature correlations. In AMIA Annual Symposium Proceedings, Vol. 2014. American Medical Informatics Association, 1170.Google Scholar
H.-U. Wittchen, S. Mhlig, and Beesdo K. 2003. Mental disorders in primary care. Dialog. Clin. Neurosci. 5, 2 (2003), 115--128.Google ScholarCross Ref
Hsien-Chung Wu. 2009. The Karush--Kuhn--Tucker optimality conditions in multiobjective programming problems with interval-valued objective functions. Eur. J. Operat. Res. 196, 1 (2009), 49--60. Google ScholarCross Ref
Lingzhou Xue, Shiqian Ma, and Hui Zou. 2012. Positive-definite 1-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107, 500 (2012), 1480--1491. Google ScholarCross Ref
Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park. 2004. An optimization criterion for generalized discriminant analysis on undersampled problems. IEEE Trans. Pattern Anal. Mach. Intell. 26, 8 (2004), 982--994. Google ScholarDigital Library
Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu Chang, and Lei Hua. 2011. Data mining in healthcare and biomedicine: A survey of the literature. J. Med. Syst. 36, 4 (May 2011), 2431--2448. Google ScholarDigital Library
Jinghe Zhang, Haoyi Xiong, Yu Huang, Hao Wu, Kevin Leach, and Laura E. Barnes. 2015. MSEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data. In 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarDigital Library
Bichen Zheng, Jinghe Zhang, Sang Won Yoon, Sarah S. Lam, Mohammad Khasawneh, and Srikanth Poranki. 2015. Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst. Appl. 42, 20 (Nov. 2015), 7110--7120. Google ScholarDigital Library
Eric R. Ziegel. 2003. Modern applied statistics with S. Technometrics 45, 1 (2003), 111.Google ScholarCross Ref

Index Terms

Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders
1. Applied computing
  1. Life and medical sciences

Recommendations

Adapting machine learning techniques to censored time-to-event health record data

Display Omitted Right-censored outcomes are common in biomedical prediction problems.We discuss adapting machine learning (ML) algorithms to these outcomes using IPCW.IPCW is a general-purpose approach which can be applied to many ML techniques.ML with ...
Read More
Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data

Models for predicting the risk of cardiovascular (CV) events based on individual patient characteristics are important tools for managing patient care. Most current and commonly used risk prediction models have been built from carefully selected ...
Read More
Ensemble learning-based early detection of influenza disease
Abstract
Across the world, the seasonal disease influenza is a respiratory illness that impacts all age groups in many ways. Its symptoms are fever, chills, aches, pains, headaches, fatigue, cough, and weakness. Seasonal influenza can cause mild to severe ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Intelligent Systems and Technology Volume 8, Issue 3
Special Issue: Mobile Social Multimedia Analytics in the Big Data Era and Regular Papers
May 2017
320 pages
ISSN:2157-6904
EISSN:2157-6912
DOI:10.1145/3040485
Editor:
Yu Zheng
Microsoft Research, China
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 February 2017
- Accepted: 1 October 2016
- Revised: 1 July 2016
- Received: 1 December 2015
Published in tist Volume 8, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Predictive models
anxiety/depression
early detection
electronic health data
temporal order
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 468
  Total Downloads
- Downloads (Last 12 months)34
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

ACM Transactions on Intelligent Systems and Technology

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Adapting machine learning techniques to censored time-to-event health record data

Data mining for censored time-to-event data: a Bayesian network model for predicting cardiovascular risk from electronic health record data

Ensemble learning-based early detection of influenza disease