skip to main content
research-article

Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

Published:08 February 2017Publication History
Skip Abstract Section

Abstract

Electronic health records (EHR) provide a rich source of temporal data that present a unique opportunity to characterize disease patterns and risk of imminent disease. While many data-mining tools have been adopted for EHR-based disease early detection, linear discriminant analysis (LDA) is one of the most commonly used statistical methods. However, it is difficult to train an accurate LDA model for early disease diagnosis when too few patients are known to have the target disease. Furthermore, EHR data are heterogeneous with significant noise. In such cases, the covariance matrices used in LDA are usually singular and estimated with a large variance.

This article presents Daehr, an extension of the LDA framework using electronic health record data to address these issues. Beyond existing LDA analyzers, we propose Daehr to (1) eliminate the data noise caused by the manual encoding of EHR data and (2) lower the variance of parameter (covariance matrices) estimation for LDA models when only a few patients’ EHR are available for training. To achieve these two goals, we designed an iterative algorithm to improve the covariance matrix estimation with embedded data-noise/parameter-variance reduction for LDA. We evaluated Daehr extensively using the College Health Surveillance Network, a large, real-world EHR dataset. Specifically, our experiments compared the performance of LDA to three baselines (i.e., LDA and its derivatives) in identifying college students at high risk for mental health disorders from 23 U.S. universities. Experimental results demonstrate Daehr significantly outperforms the three baselines by achieving 1.4%--19.4% higher accuracy and a 7.5%--43.5% higher F1-score.

Skip Supplemental Material Section

Supplemental Material

References

  1. 2012. CMS: Electronic Health Records. Retrieved from https://www.cms.gov/Medicare/E-health/EHealthRecords/index.html.Google ScholarGoogle Scholar
  2. 2015. Any Mental Illness (AMI) Among Adults. NIH National Institute of Mental Health. Retrieved from http://www.nimh.nih.gov/.Google ScholarGoogle Scholar
  3. Ruben Amarasingham, Billy J. Moore, Ying P. Tabak, Mark H. Drazner, Christopher A. Clark, Song Zhang, W. Gary Reed, Timothy S. Swanson, Ying Ma, and Ethan A. Halm. 2010. An automated model to identify heart failure patients at risk for 30-day readmission or death using electronic medical record data. Med. Care 48, 11 (2010), 981--988. Google ScholarGoogle ScholarCross RefCross Ref
  4. American College Health Association. 2014. American college health association national college health assessment. Spring 2014 Reference Group Executive Summary. Retrieved from http://www.ijme.net/archive/2/communication-training-and-perceived-patient-similarity/.Google ScholarGoogle Scholar
  5. James O. Berger. 2013. Statistical Decision Theory and Bayesian Analysis. Springer Science 8 Business Media.Google ScholarGoogle Scholar
  6. Lev M. Bregman. 1967. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 7, 3 (1967), 200--217. Google ScholarGoogle ScholarCross RefCross Ref
  7. T. Tony Cai and Harrison H. Zhou. 2012. Minimax estimation of large covariance matrices under l1 norm. Stat. Sin. 22, 4 (2012), 1319--1378. Google ScholarGoogle ScholarCross RefCross Ref
  8. Luca Cazzanti and Maya R. Gupta. 2007. Local similarity discriminant analysis. In Proceedings of the 24th International Conference on Machine Learning (ICML’07). ACM, New York, NY, 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ward Cheney and Allen A. Goldstein. 1959. Proximity maps for convex sets. Proc. Am. Math. Soc. 10, 3 (1959), 448--450. Google ScholarGoogle ScholarCross RefCross Ref
  10. Line Clemmensen, Trevor Hastie, Daniela Witten, and Bjarne Ersbøll. 2011. Sparse discriminant analysis. Technometrics 53, 4 (2011), 406--413. Google ScholarGoogle Scholar
  11. Ralph B. D’Agostino, Scott Grundy, Lisa M. Sullivan, and Peter Wilson. 2001. Validation of the Framingham coronary heart disease prediction scores: Results of a multiple ethnic groups investigation. J. Am. Med. Am. 286, 2 (2001), 180--187. Google ScholarGoogle ScholarCross RefCross Ref
  12. Erik R. Dubberke, Kimberly A. Reske, L. Clifford McDonald, and Victoria J. Fraser. 2006. ICD-9 codes and surveillance for clostridium difficile--associated disease. Emerg. Infect. Dis. 12, 10 (2006), 1576. Google ScholarGoogle ScholarCross RefCross Ref
  13. René Escalante and Marcos Raydan. 2011. Alternating Projection Methods. Vol. 8. SIAM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ronald A. Fisher. 1936. The use of multiple measurements in taxonomic problems. Ann. Eugen. 7, 2 (1936), 179--188. Google ScholarGoogle ScholarCross RefCross Ref
  15. Hui Gao and James W. Davis. 2006. Why direct LDA is not equivalent to LDA. Pattern Recogn. 39, 5 (2006), 1002--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. E. Gil-Herrera, G. Aden-Buie, A. Yalcin, A. Tsalatsanis, L. E. Barnes, and B. Djulbegovic. 2015. Rough set theory based prognostic classification models for hospice referral. BMC Med. Inform. Dec. Making 15, 1 (2015), 98.Google ScholarGoogle ScholarCross RefCross Ref
  17. Benjamin A. Goldstein, Ann Marie Navar, Michael J. Pencina, and John P. A. Ioannidis. 2016. Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. J. Am. Med. Inform. Assoc. (2016).Google ScholarGoogle Scholar
  18. David Gotz, Fei Wang, and Adam Perer. 2014. A methodology for interactive mining and visual analysis of clinical event patterns using electronic health record data. J. Biomed. Inform. 48 (April 2014), 148--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. HCUP. 2014. Appendix A - Clinical Classification Software-DIAGNOSES (January 1980 through September 2014). Retrieved from https://www.hcup-us.ahrq.gov/toolssoftware/ccs/AppendixASingleDX.txt.Google ScholarGoogle Scholar
  20. Nicholas J. Higham. 2002. Computing the nearest correlation matrixa problem from finance. IMA J. Numer. Anal. 22, 3 (2002), 329--343. Google ScholarGoogle ScholarCross RefCross Ref
  21. Pao-Lu Hsu and Herbert Robbins. 1947. Complete convergence and the law of large numbers. Proc. Natl. Acad. Sci. U.S.A. 33, 2 (1947), 25. Google ScholarGoogle ScholarCross RefCross Ref
  22. Rui Huang, Qingshan Liu, Hanqing Lu, and Songde Ma. 2002. Solving the small sample size problem of LDA. In Proceedings of the 16th International Conference on Pattern Recognition, 2002, Vol. 3. IEEE, 29--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014a. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (2014), 1069--1075.Google ScholarGoogle ScholarCross RefCross Ref
  24. Sandy H. Huang, Paea LePendu, Srinivasan V. Iyer, Ming Tai-Seale, David Carrell, and Nigam H. Shah. 2014b. Toward personalizing treatment for depression: Predicting diagnosis and severity. J. Am. Med. Inform. Assoc. 21, 6 (Dec. 2014), 1069--1075. Google ScholarGoogle ScholarCross RefCross Ref
  25. Peter B. Jensen, Lars J. Jensen, and Søren Brunak. 2012. Mining electronic health records: Towards better research applications and clinical care. Nat. Rev. Genet. 13, 6 (2012), 395--405. Google ScholarGoogle ScholarCross RefCross Ref
  26. Susan Jensen and UK SPSS. 2001. Mining medical data for predictive and sequential patterns: PKDD 2001. In Proceedings of the 5th European Conference on Principles and Practice of Knowledge Discovery in Databases.Google ScholarGoogle Scholar
  27. Jan Kalina, Libor Seidl, Karel Zvára, Hana Grünfeldová, Dalibor Slovák, and Jana Zvárová. 2013. Selecting relevant information for medical decision support with application to cardiology. Eur. J. Biomed. Inform. 9, 1 (2013), 2--6.Google ScholarGoogle ScholarCross RefCross Ref
  28. Isak Karlsson and Henrik Bostrom. 2014. Handling sparsity with random forests when predicting adverse drug events from electronic health records. In Proceedings of the 2014 IEEE International Conference on Healthcare Informatics (ICHI). IEEE, 17--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kenneth S. Kendler, John M. Hettema, Frank Butera, Charles O. Gardner, and Carol A. Prescott. 2003. Life event dimensions of loss, humiliation, entrapment, and danger in the prediction of onsets of major depression and generalized anxiety. Arch. Gen. Psychiat. 60, 8 (2003), 789--796. Google ScholarGoogle ScholarCross RefCross Ref
  30. Kurt Kroenke and Robert L. Spitzer. 2002. The PHQ-9: A new depression diagnostic and severity measure. Psychiatr. Ann. 32, 9 (2002), 1--7. Google ScholarGoogle ScholarCross RefCross Ref
  31. Jaana Lindstrom and Jaakko Tuomilehto. 2003. The diabetes risk score: A practical tool to predict type 2 diabetes risk. Diabetes Care 26, 3 (2003), 725--731. Google ScholarGoogle ScholarCross RefCross Ref
  32. Chuanren Liu, Fei Wang, Jianying Hu, and Hui Xiong. 2015. Temporal phenotyping from longitudinal electronic health records: A graph based framework. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’15). ACM, New York, NY, 705--714. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Juwei Lu, Kostantinos N. Plataniotis, and Anastasios N. Venetsanopoulos. 2003. Face recognition using LDA-based algorithms. IEEE Trans. Neur. Netw. 14, 1 (2003), 195--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Joo Maroco, Dina Silva, Ana Rodrigues, Manuela Guerreiro, Isabel Santana, and Alexandre de Mendona. 2011. Data mining methods in the prediction of dementia: A real-data comparison of the accuracy, sensitivity and specificity of linear discriminant analysis, logistic regression, neural networks, support vector machines, classification trees and random forests. BMC Res. Notes 4, 1 (Aug. 2011), 299.Google ScholarGoogle ScholarCross RefCross Ref
  35. Geoffrey McLachlan. 2004. Discriminant Analysis and Statistical Pattern Recognition. Vol. 544. John Wiley 8 Sons.Google ScholarGoogle Scholar
  36. S. Mitchell, K. Schinkel, Y. Song, Y. Wang, J. Ainsworth, T. Halbert, S. Strong, J. Zhang, C. C. Moore, and L. E. Barnes. 2016. Optimization of sepsis risk assessment for ward patients. In Proceedings of the IEEE Systems and Information Engineering Design Symposium (SIEDS). 107--112. Google ScholarGoogle ScholarCross RefCross Ref
  37. Yurii Nesterov. 2004. Introductory Lectures on Convex Optimization. Vol. 87. Springer Science 8 Business Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015a. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summit on Clinical Research Informatics (CRI) (2015), 132.Google ScholarGoogle Scholar
  39. Kenney Ng, Jimeng Sun, Jianying Hu, and Fei Wang. 2015b. Personalized predictive modeling and risk factor identification using patient similarity. AMIA Summits on Translational Science Proceedings 2015 (March 2015), 132--136.Google ScholarGoogle Scholar
  40. Alicia Nobles, Ketki Vilankar, Hao Wu, and Laura Barnes. 2015. Evaluation of data quality of multisite electronic health record data for secondary analysis. In Proceedings of the 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Adam Perer and Fei Wang. 2014. Frequence: Interactive mining and visualization of temporal frequent event sequences. In Proceedings of the 19th International Conference on Intelligent User Interfaces. ACM, 153--162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Adam Perer, Fei Wang, and Jianying Hu. 2015. Mining and exploring care pathways from electronic medical records with visual analytics. J. Biomed. Inform. 56 (Aug. 2015), 369--378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jennifer Pittman, Erich Huang, Holly Dressman, Cheng-Fang Horng, Skye H. Cheng, Mei-Hua Tsou, Chii-Ming Chen, Andrea Bild, Edwin S. Iversen, Andrew T. Huang, and others. 2004. Integrated modeling of clinical and gene expression information for personalized prediction of disease outcomes. Proc. Natl Acad. Sci. U.S.A. 101, 22 (2004), 8431--8436. Google ScholarGoogle ScholarCross RefCross Ref
  44. Zhihua Qiao, Lan Zhou, and Jianhua Z. Huang. 2008. Effective linear discriminant analysis for high dimensional, low sample size data. In Proceeding of the World Congress on Engineering, Vol. 2. Citeseer, 2--4.Google ScholarGoogle Scholar
  45. Jun Shao, Yazhen Wang, Xinwei Deng, Sijian Wang, and others. 2011. Sparse linear discriminant analysis by thresholding for high dimensional data. Ann. Stat. 39, 2 (2011), 1241--1265. Google ScholarGoogle ScholarCross RefCross Ref
  46. George C. M. Siontis, Ioanna Tzoulaki, Konstantinos C. Siontis, and John P. A. Ioannidis. 2012. Comparisons of established risk prediction models for cardiovascular disease: Systematic review. Br. Med. J. 344 (2012). Google ScholarGoogle ScholarCross RefCross Ref
  47. Jimeng Sun, Fei Wang, Jianying Hu, and Shahram Edabollahi. 2012. Supervised patient similarity measure of heterogeneous patient records. ACM SIGKDD Explor. Newslett. 14, 1 (2012), 16--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Barbara G. Tabachnick, Linda S. Fidell, and others. 2001. Using multivariate statistics. (2001). 530--538.Google ScholarGoogle Scholar
  49. James C. Turner and Adrienne Keller. 2015. College health surveillance network: Epidemiology and health care utilization of college students at U.S. 4-year universities. J. Am. College Health: J. ACH (June 2015).Google ScholarGoogle Scholar
  50. A. Tylee and P. Gandhi. 2005. The importance of somatic symptoms in depression in primary care. Prim. Care Compan. J. Clin. Psychiatr. 7, 4 (2005), 167--176. Google ScholarGoogle ScholarCross RefCross Ref
  51. John Von Neumann. 1951. Functional Operators: The Geometry of Orthogonal Spaces. Princeton University Press.Google ScholarGoogle Scholar
  52. Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, and Shahram Ebadollahi. 2012a. Towards heterogeneous temporal clinical event pattern discovery: A convolutional approach. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 453--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Fei Wang, Noah Lee, Jianying Hu, Jimeng Sun, Shahram Ebadollahi, and A. Laine. 2012b. A framework for mining signatures from event sequences and its applications in healthcare data. 272--285. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Fei Wang and Jimeng Sun. 2015. PSF: A unified patient similarity evaluation framework through metric learning with weak supervision. IEEE J. Biomed. Health Inform. 19, 3 (May 2015), 1053--1060. Google ScholarGoogle ScholarCross RefCross Ref
  55. Fei Wang, Ping Zhang, Xiang Wang, and Jianying Hu. 2014. Clinical risk prediction by exploring high-order feature correlations. In AMIA Annual Symposium Proceedings, Vol. 2014. American Medical Informatics Association, 1170.Google ScholarGoogle Scholar
  56. H.-U. Wittchen, S. Mhlig, and Beesdo K. 2003. Mental disorders in primary care. Dialog. Clin. Neurosci. 5, 2 (2003), 115--128.Google ScholarGoogle ScholarCross RefCross Ref
  57. Hsien-Chung Wu. 2009. The Karush--Kuhn--Tucker optimality conditions in multiobjective programming problems with interval-valued objective functions. Eur. J. Operat. Res. 196, 1 (2009), 49--60. Google ScholarGoogle ScholarCross RefCross Ref
  58. Lingzhou Xue, Shiqian Ma, and Hui Zou. 2012. Positive-definite 1-penalized estimation of large covariance matrices. J. Am. Statist. Assoc. 107, 500 (2012), 1480--1491. Google ScholarGoogle ScholarCross RefCross Ref
  59. Jieping Ye, Ravi Janardan, Cheong Hee Park, and Haesun Park. 2004. An optimization criterion for generalized discriminant analysis on undersampled problems. IEEE Trans. Pattern Anal. Mach. Intell. 26, 8 (2004), 982--994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Illhoi Yoo, Patricia Alafaireet, Miroslav Marinov, Keila Pena-Hernandez, Rajitha Gopidi, Jia-Fu Chang, and Lei Hua. 2011. Data mining in healthcare and biomedicine: A survey of the literature. J. Med. Syst. 36, 4 (May 2011), 2431--2448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Jinghe Zhang, Haoyi Xiong, Yu Huang, Hao Wu, Kevin Leach, and Laura E. Barnes. 2015. MSEQ: Early detection of anxiety and depression via temporal orders of diagnoses in electronic health data. In 2015 International Conference on Big Data (Workshop). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Bichen Zheng, Jinghe Zhang, Sang Won Yoon, Sarah S. Lam, Mohammad Khasawneh, and Srikanth Poranki. 2015. Predictive modeling of hospital readmissions using metaheuristics and data mining. Expert Syst. Appl. 42, 20 (Nov. 2015), 7110--7120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Eric R. Ziegel. 2003. Modern applied statistics with S. Technometrics 45, 1 (2003), 111.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Daehr: A Discriminant Analysis Framework for Electronic Health Record Data and an Application to Early Detection of Mental Health Disorders

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 8, Issue 3
      Special Issue: Mobile Social Multimedia Analytics in the Big Data Era and Regular Papers
      May 2017
      320 pages
      ISSN:2157-6904
      EISSN:2157-6912
      DOI:10.1145/3040485
      • Editor:
      • Yu Zheng
      Issue’s Table of Contents

      Copyright © 2017 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 February 2017
      • Accepted: 1 October 2016
      • Revised: 1 July 2016
      • Received: 1 December 2015
      Published in tist Volume 8, Issue 3

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader