skip to main content
10.1145/3219819.3219999acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Public Access

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

Published:19 July 2018Publication History

ABSTRACT

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain_M and SUSTain_T, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.

Skip Supplemental Material Section

Supplemental Material

perros_sustain_unsupervised.mp4

mp4

353.4 MB

References

  1. 2017. Clinical Classifications Software (CCS) for ICD-9-CM. https://www.hcup-us. ahrq.gov/toolssoftware/ccs/ccs.jsp. (2017). Accessed: 2017-02--11.Google ScholarGoogle Scholar
  2. Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (2007), i10--i18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (2007), 205--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Brett W. Bader, Tamara G. Kolda, and others. 2015. MATLAB Tensor Toolbox Version 2.6. Available online. (February 2015). http://www.sandia.gov/~tgkolda/ TensorToolbox/Google ScholarGoogle Scholar
  5. Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Stephen Breen and Xiao-Wen Chang. 2012. Column Reordering for Box- Constrained Integer Least Squares Problems. (April 2012).Google ScholarGoogle Scholar
  7. John Brewer. 1978. Kronecker products and matrix calculus in system theory. IEEE Transactions on circuits and systems 25, 9 (1978), 772--781.Google ScholarGoogle ScholarCross RefCross Ref
  8. Rasmus Bro and Nicholaos D Sidiropoulos. 1998. Least squares algorithms under unimodality and non-negativity constraints. Journal of Chemometrics 12, 4 (1998), 223--247.Google ScholarGoogle ScholarCross RefCross Ref
  9. J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scaling via an N-way generalization of "Eckart-Young" decomposition. Psychometrika 35, 3 (1970), 283--319.Google ScholarGoogle ScholarCross RefCross Ref
  10. Xiao-Wen Chang and Tianyang Zhou. 2007. MILES: MATLAB package for solving mixed integer least squares problems. GPS Solutions 11, 4 (2007), 289--294. Last updated: June 2016.Google ScholarGoogle ScholarCross RefCross Ref
  11. Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 2 (2016), 361--370.Google ScholarGoogle ScholarCross RefCross Ref
  12. Andrzej Cichocki and Anh-Huy Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google ScholarGoogle Scholar
  13. Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 126--135. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bo Dong, Matthew M Lin, and Haesun Park. 2017. Integer Matrix Approximation and Data Mining. Journal of scientific computing (Sept. 2017), 1--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Nicolas Gillis and others. 2011. Nonnegative matrix factorization: Complexity, algorithms and applications. Unpublished doctoral dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE (2011).Google ScholarGoogle Scholar
  17. Gene H Golub and Charles F Van Loan. 2013. Matrix Computations. Vol. 3. JHU Press.Google ScholarGoogle Scholar
  18. Richard A Harshman. 1970. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-modal factor analysis. (1970).Google ScholarGoogle Scholar
  19. J Henderson, J C Ho, A N Kho, J C Denny, B A Malin, J Sun, and J Ghosh. 2017. Granite: Diversified, Sparse Tensor Factorization for Electronic Health Record-Based Phenotyping. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). 214--223.Google ScholarGoogle Scholar
  20. Frank L Hitchcock. 1927. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics 6, 1--4 (1927), 164--189.Google ScholarGoogle Scholar
  21. Joyce C Ho, Joydeep Ghosh, Steve R Steinhubl,Walter F Stewart, Joshua C Denny, Bradley A Malin, and Jimeng Sun. 2014. Limestone: high-throughput candidate phenotype generation via tensor factorization. Journal of biomedical informatics 52 (Dec. 2014), 199--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ngoc-Diep Ho. 2008. Nonnegative matrix factorization algorithms and applications. Ph.D. Dissertation. PhD thesis, Université catholique de Louvain.Google ScholarGoogle Scholar
  24. Shalmali Joshi, Suriya Gunasekar, David Sontag, and Ghosh Joydeep. 2016. Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization. In Machine Learning for Healthcare Conference. 17--41.Google ScholarGoogle Scholar
  25. R. Kannan, G. Ballard, and H. Park. 2018. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization. IEEE Transactions on Knowledge and Data Engineering 30, 3 (March 2018), 544--558.Google ScholarGoogle ScholarCross RefCross Ref
  26. Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. 2010. Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 79--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Jingu Kim, Yunlong He, and Haesun Park. 2014. Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based on Block Coordinate Descent Framework. Journal of Global Optimization 58, 2 (Feb. 2014), 285--319. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jingu Kim and Haesun Park. 2011. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing 33, 6 (2011), 3261--3281. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Tamara G Kolda and BrettWBader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Tamara G Kolda and Dianne P O'Leary. 1998. A Semidiscrete Matrix Decomposition for Latent Semantic Indexing Information Retrieval. ACM Transactions on Information and System Security 16, 4 (Oct. 1998), 322--346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tamara G Kolda and Jimeng Sun. 2008. Scalable tensor decompositions for multiaspect data mining. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 363--372. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 447--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan. 2006. Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Transactions on Mathematical Software (TOMS) 32, 1 (2006), 33--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Joseph B Leader, Sarah A Pendergrass, Anurag Verma, David J Carey, Dustin N Hartzel, Marylyn D Ritchie, and H Lester Kirchner. 2015. Contrasting association results between existing PheWAS phenotype definition methods and five validated electronic phenotypes. In AMIA Annual Symposium Proceedings, Vol. 2015. American Medical Informatics Association, 824.Google ScholarGoogle Scholar
  35. D D Lee and H S Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarGoogle ScholarCross RefCross Ref
  36. Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete Content-aware Matrix Factorization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 325--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Pauli Miettinen. 2011. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 447--456. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. P Miettinen, T Mielikäinen, A Gionis, G Das, and H Mannila. 2008. The Discrete Basis Problem. IEEE transactions on knowledge and data engineering 20, 10 (Oct. 2008), 1348--1362. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Gerhard Nahler. 2009. anatomical therapeutic chemical classification system (ATC). Springer Vienna, Vienna, 8--8.Google ScholarGoogle Scholar
  40. D O'Leary and S Peleg. 1983. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications 31, 3 (March 1983), 441--444.Google ScholarGoogle ScholarCross RefCross Ref
  41. Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse hierarchical tucker factorization and its application to healthcare. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 943--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Ioakeim Perros, Evangelos E Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. 2017. SPARTan: Scalable PARAFAC2 for Large &Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, 375--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Rachel L Richesson, Jimeng Sun, Jyotishman Pathak, Abel N Kho, and Joshua C Denny. 2016. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artificial intelligence in medicine 71 (July 2016), 57--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. 2009. Mining Discrete Patterns via Binary Matrix Factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09). ACM, New York, NY, USA, 757--766. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vergil N Slee. 1978. The International classification of diseases: ninth revision (ICD-9). Annals of internal medicine 88, 3 (1978), 424--426.Google ScholarGoogle ScholarCross RefCross Ref
  46. Age Smilde, Rasmus Bro, and Paul Geladi. 2005. Multi-way analysis: applications in the chemical sciences. John Wiley &Sons.Google ScholarGoogle Scholar
  47. Reza Takapoui, Nicholas Moehle, Stephen Boyd, and Alberto Bemporad. 2017. A simple effective heuristic for embedded mixed-integer quadratic programming. Internat. J. Control (2017), 1--11.Google ScholarGoogle Scholar
  48. Berk Ustun and Cynthia Rudin. 2017. Optimized Risk Scores. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1125--1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. X w. Chang and Q Han. 2008. Solving Box-Constrained Integer Least Squares Problems. IEEE Transactions on Wireless Communications 7, 1 (Jan. 2008), 277--287. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen, Bradley A Malin, and Jimeng Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences 113, 16 (2016), 4290--4295.Google ScholarGoogle ScholarCross RefCross Ref
  52. Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. 2018. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR) 50, 6 (2018), 85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Z Zhang, T Li, C Ding, and X Zhang. 2007. Binary Matrix Factorization with Applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 391--400. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819

        Copyright © 2018 Owner/Author

        This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader