ABSTRACT
This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain_M and SUSTain_T, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.
Supplemental Material
- 2017. Clinical Classifications Software (CCS) for ICD-9-CM. https://www.hcup-us. ahrq.gov/toolssoftware/ccs/ccs.jsp. (2017). Accessed: 2017-02--11.Google Scholar
- Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (2007), i10--i18. Google ScholarDigital Library
- Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (2007), 205--231. Google ScholarDigital Library
- Brett W. Bader, Tamara G. Kolda, and others. 2015. MATLAB Tensor Toolbox Version 2.6. Available online. (February 2015). http://www.sandia.gov/~tgkolda/ TensorToolbox/Google Scholar
- Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press. Google ScholarDigital Library
- Stephen Breen and Xiao-Wen Chang. 2012. Column Reordering for Box- Constrained Integer Least Squares Problems. (April 2012).Google Scholar
- John Brewer. 1978. Kronecker products and matrix calculus in system theory. IEEE Transactions on circuits and systems 25, 9 (1978), 772--781.Google ScholarCross Ref
- Rasmus Bro and Nicholaos D Sidiropoulos. 1998. Least squares algorithms under unimodality and non-negativity constraints. Journal of Chemometrics 12, 4 (1998), 223--247.Google ScholarCross Ref
- J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scaling via an N-way generalization of "Eckart-Young" decomposition. Psychometrika 35, 3 (1970), 283--319.Google ScholarCross Ref
- Xiao-Wen Chang and Tianyang Zhou. 2007. MILES: MATLAB package for solving mixed integer least squares problems. GPS Solutions 11, 4 (2007), 289--294. Last updated: June 2016.Google ScholarCross Ref
- Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 2 (2016), 361--370.Google ScholarCross Ref
- Andrzej Cichocki and Anh-Huy Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google Scholar
- Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210. Google ScholarDigital Library
- Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 126--135. Google ScholarDigital Library
- Bo Dong, Matthew M Lin, and Haesun Park. 2017. Integer Matrix Approximation and Data Mining. Journal of scientific computing (Sept. 2017), 1--27. Google ScholarDigital Library
- Nicolas Gillis and others. 2011. Nonnegative matrix factorization: Complexity, algorithms and applications. Unpublished doctoral dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE (2011).Google Scholar
- Gene H Golub and Charles F Van Loan. 2013. Matrix Computations. Vol. 3. JHU Press.Google Scholar
- Richard A Harshman. 1970. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-modal factor analysis. (1970).Google Scholar
- J Henderson, J C Ho, A N Kho, J C Denny, B A Malin, J Sun, and J Ghosh. 2017. Granite: Diversified, Sparse Tensor Factorization for Electronic Health Record-Based Phenotyping. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). 214--223.Google Scholar
- Frank L Hitchcock. 1927. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics 6, 1--4 (1927), 164--189.Google Scholar
- Joyce C Ho, Joydeep Ghosh, Steve R Steinhubl,Walter F Stewart, Joshua C Denny, Bradley A Malin, and Jimeng Sun. 2014. Limestone: high-throughput candidate phenotype generation via tensor factorization. Journal of biomedical informatics 52 (Dec. 2014), 199--211. Google ScholarDigital Library
- Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124. Google ScholarDigital Library
- Ngoc-Diep Ho. 2008. Nonnegative matrix factorization algorithms and applications. Ph.D. Dissertation. PhD thesis, Université catholique de Louvain.Google Scholar
- Shalmali Joshi, Suriya Gunasekar, David Sontag, and Ghosh Joydeep. 2016. Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization. In Machine Learning for Healthcare Conference. 17--41.Google Scholar
- R. Kannan, G. Ballard, and H. Park. 2018. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization. IEEE Transactions on Knowledge and Data Engineering 30, 3 (March 2018), 544--558.Google ScholarCross Ref
- Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. 2010. Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 79--86. Google ScholarDigital Library
- Jingu Kim, Yunlong He, and Haesun Park. 2014. Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based on Block Coordinate Descent Framework. Journal of Global Optimization 58, 2 (Feb. 2014), 285--319. Google ScholarDigital Library
- Jingu Kim and Haesun Park. 2011. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing 33, 6 (2011), 3261--3281. Google ScholarDigital Library
- Tamara G Kolda and BrettWBader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500. Google ScholarDigital Library
- Tamara G Kolda and Dianne P O'Leary. 1998. A Semidiscrete Matrix Decomposition for Latent Semantic Indexing Information Retrieval. ACM Transactions on Information and System Security 16, 4 (Oct. 1998), 322--346. Google ScholarDigital Library
- Tamara G Kolda and Jimeng Sun. 2008. Scalable tensor decompositions for multiaspect data mining. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 363--372. Google ScholarDigital Library
- Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 447--461. Google ScholarDigital Library
- Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan. 2006. Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Transactions on Mathematical Software (TOMS) 32, 1 (2006), 33--69. Google ScholarDigital Library
- Joseph B Leader, Sarah A Pendergrass, Anurag Verma, David J Carey, Dustin N Hartzel, Marylyn D Ritchie, and H Lester Kirchner. 2015. Contrasting association results between existing PheWAS phenotype definition methods and five validated electronic phenotypes. In AMIA Annual Symposium Proceedings, Vol. 2015. American Medical Informatics Association, 824.Google Scholar
- D D Lee and H S Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarCross Ref
- Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete Content-aware Matrix Factorization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 325--334. Google ScholarDigital Library
- Pauli Miettinen. 2011. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 447--456. Google ScholarDigital Library
- P Miettinen, T Mielikäinen, A Gionis, G Das, and H Mannila. 2008. The Discrete Basis Problem. IEEE transactions on knowledge and data engineering 20, 10 (Oct. 2008), 1348--1362. Google ScholarDigital Library
- Gerhard Nahler. 2009. anatomical therapeutic chemical classification system (ATC). Springer Vienna, Vienna, 8--8.Google Scholar
- D O'Leary and S Peleg. 1983. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications 31, 3 (March 1983), 441--444.Google ScholarCross Ref
- Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse hierarchical tucker factorization and its application to healthcare. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 943--948. Google ScholarDigital Library
- Ioakeim Perros, Evangelos E Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. 2017. SPARTan: Scalable PARAFAC2 for Large &Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, 375--384. Google ScholarDigital Library
- Rachel L Richesson, Jimeng Sun, Jyotishman Pathak, Abel N Kho, and Joshua C Denny. 2016. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artificial intelligence in medicine 71 (July 2016), 57--61. Google ScholarDigital Library
- Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. 2009. Mining Discrete Patterns via Binary Matrix Factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09). ACM, New York, NY, USA, 757--766. Google ScholarDigital Library
- Vergil N Slee. 1978. The International classification of diseases: ninth revision (ICD-9). Annals of internal medicine 88, 3 (1978), 424--426.Google ScholarCross Ref
- Age Smilde, Rasmus Bro, and Paul Geladi. 2005. Multi-way analysis: applications in the chemical sciences. John Wiley &Sons.Google Scholar
- Reza Takapoui, Nicholas Moehle, Stephen Boyd, and Alberto Bemporad. 2017. A simple effective heuristic for embedded mixed-integer quadratic programming. Internat. J. Control (2017), 1--11.Google Scholar
- Berk Ustun and Cynthia Rudin. 2017. Optimized Risk Scores. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1125--1134. Google ScholarDigital Library
- X w. Chang and Q Han. 2008. Solving Box-Constrained Integer Least Squares Problems. IEEE Transactions on Wireless Communications 7, 1 (Jan. 2008), 277--287. Google ScholarDigital Library
- Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen, Bradley A Malin, and Jimeng Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarDigital Library
- Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences 113, 16 (2016), 4290--4295.Google ScholarCross Ref
- Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. 2018. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR) 50, 6 (2018), 85. Google ScholarDigital Library
- Z Zhang, T Li, C Ding, and X Zhang. 2007. Binary Matrix Factorization with Applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 391--400. Google ScholarDigital Library
Index Terms
- SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping
Recommendations
SPARTan: Scalable PARAFAC2 for Large & Sparse Data
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningIn exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of ...
COPA: Constrained PARAFAC2 for Sparse & Large Datasets
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge ManagementPARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. ...
Boolean Tensor Factorizations
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data MiningTensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and ...
Comments