research-article

Public Access

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

Authors:
Ioakeim Perros

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Evangelos E. Papalexakis

University of California, Riverside, Riverside, CA, USA

University of California, Riverside, Riverside, CA, USA
View Profile

,
Haesun Park

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Richard Vuduc

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

,
Xiaowei Yan

Sutter Health, Walnut Creek, CA, USA

Sutter Health, Walnut Creek, CA, USA
View Profile

,
Christopher Defilippi

Inova Heart and Vascular Institute, West Falls Church, VA, USA

Inova Heart and Vascular Institute, West Falls Church, VA, USA
View Profile

,
Walter F. Stewart

Sutter Health, Walnut Creek, CA, USA

Sutter Health, Walnut Creek, CA, USA
View Profile

,
Jimeng Sun

Georgia Institute of Technology, Atlanta, GA, USA

Georgia Institute of Technology, Atlanta, GA, USA
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 2080–2089https://doi.org/10.1145/3219819.3219999

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 2080–2089

ABSTRACT

This paper presents a new method, which we call SUSTain, that extends real-valued matrix and tensor factorizations to data where values are integers. Such data are common when the values correspond to event counts or ordinal measures. The conventional approach is to treat integer data as real, and then apply real-valued factorizations. However, doing so fails to preserve important characteristics of the original data, thereby making it hard to interpret the results. Instead, our approach extracts factor values from integer datasets as scores that are constrained to take values from a small integer set. These scores are easy to interpret: a score of zero indicates no feature contribution and higher scores indicate distinct levels of feature importance. At its core, SUSTain relies on: a) a problem partitioning into integer-constrained subproblems, so that they can be optimally solved in an efficient manner; and b) organizing the order of the subproblems' solution, to promote reuse of shared intermediate results. We propose two variants, SUSTain_M and SUSTain_T, to handle both matrix and tensor inputs, respectively. We evaluate SUSTain against several state-of-the-art baselines on both synthetic and real Electronic Health Record (EHR) datasets. Comparing to those baselines, SUSTain shows either significantly better fit or orders of magnitude speedups that achieve a comparable fit (up to 425× faster). We apply SUSTain to EHR datasets to extract patient phenotypes (i.e., clinically meaningful patient clusters). Furthermore, 87% of them were validated as clinically meaningful phenotypes related to heart failure by a cardiologist.

Supplemental Material

perros_sustain_unsupervised.mp4

mp4

353.4 MB

Download

References

2017. Clinical Classifications Software (CCS) for ICD-9-CM. https://www.hcup-us. ahrq.gov/toolssoftware/ccs/ccs.jsp. (2017). Accessed: 2017-02--11.Google Scholar
Evrim Acar, Canan Aykut-Bingol, Haluk Bingol, Rasmus Bro, and Bülent Yener. 2007. Multiway analysis of epilepsy tensors. Bioinformatics 23, 13 (2007), i10--i18. Google ScholarDigital Library
Brett W Bader and Tamara G Kolda. 2007. Efficient MATLAB computations with sparse and factored tensors. SIAM Journal on Scientific Computing 30, 1 (2007), 205--231. Google ScholarDigital Library
Brett W. Bader, Tamara G. Kolda, and others. 2015. MATLAB Tensor Toolbox Version 2.6. Available online. (February 2015). http://www.sandia.gov/~tgkolda/ TensorToolbox/Google Scholar
Stephen Boyd and Lieven Vandenberghe. 2004. Convex optimization. Cambridge university press. Google ScholarDigital Library
Stephen Breen and Xiao-Wen Chang. 2012. Column Reordering for Box- Constrained Integer Least Squares Problems. (April 2012).Google Scholar
John Brewer. 1978. Kronecker products and matrix calculus in system theory. IEEE Transactions on circuits and systems 25, 9 (1978), 772--781.Google ScholarCross Ref
Rasmus Bro and Nicholaos D Sidiropoulos. 1998. Least squares algorithms under unimodality and non-negativity constraints. Journal of Chemometrics 12, 4 (1998), 223--247.Google ScholarCross Ref
J Douglas Carroll and Jih-Jie Chang. 1970. Analysis of individual differences in multidimensional scaling via an N-way generalization of "Eckart-Young" decomposition. Psychometrika 35, 3 (1970), 283--319.Google ScholarCross Ref
Xiao-Wen Chang and Tianyang Zhou. 2007. MILES: MATLAB package for solving mixed integer least squares problems. GPS Solutions 11, 4 (2007), 289--294. Last updated: June 2016.Google ScholarCross Ref
Edward Choi, Andy Schuetz, Walter F Stewart, and Jimeng Sun. 2016. Using recurrent neural network models for early detection of heart failure onset. Journal of the American Medical Informatics Association 24, 2 (2016), 361--370.Google ScholarCross Ref
Andrzej Cichocki and Anh-Huy Phan. 2009. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE transactions on fundamentals of electronics, communications and computer sciences 92, 3 (2009), 708--721.Google Scholar
Joshua C Denny, Marylyn D Ritchie, Melissa A Basford, Jill M Pulley, Lisa Bastarache, Kristin Brown-Gentry, Deede Wang, Dan R Masys, Dan M Roden, and Dana C Crawford. 2010. PheWAS: demonstrating the feasibility of a phenomewide scan to discover gene--disease associations. Bioinformatics 26, 9 (2010), 1205--1210. Google ScholarDigital Library
Chris Ding, Tao Li, Wei Peng, and Haesun Park. 2006. Orthogonal nonnegative matrix t-factorizations for clustering. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 126--135. Google ScholarDigital Library
Bo Dong, Matthew M Lin, and Haesun Park. 2017. Integer Matrix Approximation and Data Mining. Journal of scientific computing (Sept. 2017), 1--27. Google ScholarDigital Library
Nicolas Gillis and others. 2011. Nonnegative matrix factorization: Complexity, algorithms and applications. Unpublished doctoral dissertation, Université catholique de Louvain. Louvain-La-Neuve: CORE (2011).Google Scholar
Gene H Golub and Charles F Van Loan. 2013. Matrix Computations. Vol. 3. JHU Press.Google Scholar
Richard A Harshman. 1970. Foundations of the PARAFAC procedure: Models and conditions for an "explanatory" multi-modal factor analysis. (1970).Google Scholar
J Henderson, J C Ho, A N Kho, J C Denny, B A Malin, J Sun, and J Ghosh. 2017. Granite: Diversified, Sparse Tensor Factorization for Electronic Health Record-Based Phenotyping. In 2017 IEEE International Conference on Healthcare Informatics (ICHI). 214--223.Google Scholar
Frank L Hitchcock. 1927. The expression of a tensor or a polyadic as a sum of products. Studies in Applied Mathematics 6, 1--4 (1927), 164--189.Google Scholar
Joyce C Ho, Joydeep Ghosh, Steve R Steinhubl,Walter F Stewart, Joshua C Denny, Bradley A Malin, and Jimeng Sun. 2014. Limestone: high-throughput candidate phenotype generation via tensor factorization. Journal of biomedical informatics 52 (Dec. 2014), 199--211. Google ScholarDigital Library
Joyce C Ho, Joydeep Ghosh, and Jimeng Sun. 2014. Marble: High-throughput Phenotyping from Electronic Health Records via Sparse Nonnegative Tensor Factorization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14). ACM, New York, NY, USA, 115--124. Google ScholarDigital Library
Ngoc-Diep Ho. 2008. Nonnegative matrix factorization algorithms and applications. Ph.D. Dissertation. PhD thesis, Université catholique de Louvain.Google Scholar
Shalmali Joshi, Suriya Gunasekar, David Sontag, and Ghosh Joydeep. 2016. Identifiable Phenotyping using Constrained Non-Negative Matrix Factorization. In Machine Learning for Healthcare Conference. 17--41.Google Scholar
R. Kannan, G. Ballard, and H. Park. 2018. MPI-FAUN: An MPI-Based Framework for Alternating-Updating Nonnegative Matrix Factorization. IEEE Transactions on Knowledge and Data Engineering 30, 3 (March 2018), 544--558.Google ScholarCross Ref
Alexandros Karatzoglou, Xavier Amatriain, Linas Baltrunas, and Nuria Oliver. 2010. Multiverse recommendation: n-dimensional tensor factorization for contextaware collaborative filtering. In Proceedings of the fourth ACM conference on Recommender systems. ACM, 79--86. Google ScholarDigital Library
Jingu Kim, Yunlong He, and Haesun Park. 2014. Algorithms for Nonnegative Matrix and Tensor Factorizations: A Unified View Based on Block Coordinate Descent Framework. Journal of Global Optimization 58, 2 (Feb. 2014), 285--319. Google ScholarDigital Library
Jingu Kim and Haesun Park. 2011. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM Journal on Scientific Computing 33, 6 (2011), 3261--3281. Google ScholarDigital Library
Tamara G Kolda and BrettWBader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455--500. Google ScholarDigital Library
Tamara G Kolda and Dianne P O'Leary. 1998. A Semidiscrete Matrix Decomposition for Latent Semantic Indexing Information Retrieval. ACM Transactions on Information and System Security 16, 4 (Oct. 1998), 322--346. Google ScholarDigital Library
Tamara G Kolda and Jimeng Sun. 2008. Scalable tensor decompositions for multiaspect data mining. In Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on. IEEE, 363--372. Google ScholarDigital Library
Mehmet Koyuturk, Ananth Grama, and Naren Ramakrishnan. 2005. Compression, clustering, and pattern discovery in very high-dimensional discrete-attribute data sets. IEEE Transactions on Knowledge and Data Engineering 17, 4 (2005), 447--461. Google ScholarDigital Library
Mehmet Koyutürk, Ananth Grama, and Naren Ramakrishnan. 2006. Nonorthogonal decomposition of binary matrices for bounded-error data compression and analysis. ACM Transactions on Mathematical Software (TOMS) 32, 1 (2006), 33--69. Google ScholarDigital Library
Joseph B Leader, Sarah A Pendergrass, Anurag Verma, David J Carey, Dustin N Hartzel, Marylyn D Ritchie, and H Lester Kirchner. 2015. Contrasting association results between existing PheWAS phenotype definition methods and five validated electronic phenotypes. In AMIA Annual Symposium Proceedings, Vol. 2015. American Medical Informatics Association, 824.Google Scholar
D D Lee and H S Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (Oct. 1999), 788--791.Google ScholarCross Ref
Defu Lian, Rui Liu, Yong Ge, Kai Zheng, Xing Xie, and Longbing Cao. 2017. Discrete Content-aware Matrix Factorization. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 325--334. Google ScholarDigital Library
Pauli Miettinen. 2011. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on. IEEE, 447--456. Google ScholarDigital Library
P Miettinen, T Mielikäinen, A Gionis, G Das, and H Mannila. 2008. The Discrete Basis Problem. IEEE transactions on knowledge and data engineering 20, 10 (Oct. 2008), 1348--1362. Google ScholarDigital Library
Gerhard Nahler. 2009. anatomical therapeutic chemical classification system (ATC). Springer Vienna, Vienna, 8--8.Google Scholar
D O'Leary and S Peleg. 1983. Digital Image Compression by Outer Product Expansion. IEEE Transactions on Communications 31, 3 (March 1983), 441--444.Google ScholarCross Ref
Ioakeim Perros, Robert Chen, Richard Vuduc, and Jimeng Sun. 2015. Sparse hierarchical tucker factorization and its application to healthcare. In Data Mining (ICDM), 2015 IEEE International Conference on. IEEE, 943--948. Google ScholarDigital Library
Ioakeim Perros, Evangelos E Papalexakis, Fei Wang, Richard Vuduc, Elizabeth Searles, Michael Thompson, and Jimeng Sun. 2017. SPARTan: Scalable PARAFAC2 for Large &Sparse Data. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '17). ACM, 375--384. Google ScholarDigital Library
Rachel L Richesson, Jimeng Sun, Jyotishman Pathak, Abel N Kho, and Joshua C Denny. 2016. Clinical phenotyping in selected national networks: demonstrating the need for high-throughput, portable, and computational methods. Artificial intelligence in medicine 71 (July 2016), 57--61. Google ScholarDigital Library
Bao-Hong Shen, Shuiwang Ji, and Jieping Ye. 2009. Mining Discrete Patterns via Binary Matrix Factorization. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '09). ACM, New York, NY, USA, 757--766. Google ScholarDigital Library
Vergil N Slee. 1978. The International classification of diseases: ninth revision (ICD-9). Annals of internal medicine 88, 3 (1978), 424--426.Google ScholarCross Ref
Age Smilde, Rasmus Bro, and Paul Geladi. 2005. Multi-way analysis: applications in the chemical sciences. John Wiley &Sons.Google Scholar
Reza Takapoui, Nicholas Moehle, Stephen Boyd, and Alberto Bemporad. 2017. A simple effective heuristic for embedded mixed-integer quadratic programming. Internat. J. Control (2017), 1--11.Google Scholar
Berk Ustun and Cynthia Rudin. 2017. Optimized Risk Scores. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1125--1134. Google ScholarDigital Library
X w. Chang and Q Han. 2008. Solving Box-Constrained Integer Least Squares Problems. IEEE Transactions on Wireless Communications 7, 1 (Jan. 2008), 277--287. Google ScholarDigital Library
Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen, Bradley A Malin, and Jimeng Sun. 2015. Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York, NY, USA, 1265--1274. Google ScholarDigital Library
Siqi Wu, Antony Joseph, Ann S Hammonds, Susan E Celniker, Bin Yu, and Erwin Frise. 2016. Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks. Proceedings of the National Academy of Sciences 113, 16 (2016), 4290--4295.Google ScholarCross Ref
Pranjul Yadav, Michael Steinbach, Vipin Kumar, and Gyorgy Simon. 2018. Mining Electronic Health Records (EHRs): A Survey. ACM Computing Surveys (CSUR) 50, 6 (2018), 85. Google ScholarDigital Library
Z Zhang, T Li, C Ding, and X Zhang. 2007. Binary Matrix Factorization with Applications. In Seventh IEEE International Conference on Data Mining (ICDM 2007). 391--400. Google ScholarDigital Library

Index Terms

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping
1. Applied computing
  1. Life and medical sciences
    1. Health informatics
2. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Factorization methods

Recommendations

SPARTan: Scalable PARAFAC2 for Large & Sparse Data
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In exploratory tensor mining, a common problem is how to analyze a set of variables across a set of subjects whose observations do not align naturally. For example, when modeling medical features across a set of patients, the number and duration of ...
Read More
COPA: Constrained PARAFAC2 for Sparse & Large Datasets
CIKM '18: Proceedings of the 27th ACM International Conference on Information and Knowledge Management

PARAFAC2 has demonstrated success in modeling irregular tensors, where the tensor dimensions vary across one of the modes. An example scenario is modeling treatments across a set of patients with the varying number of medical encounters over time. ...
Read More
Boolean Tensor Factorizations
ICDM '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining

Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 Owner/Author
This work is licensed under a Creative Commons Attribution-ShareAlike International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
matrix factorization
phenotyping
tensor factorization
unsupervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 798
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

SPARTan: Scalable PARAFAC2 for Large & Sparse Data

COPA: Constrained PARAFAC2 for Sparse & Large Datasets

Boolean Tensor Factorizations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SUSTain: Scalable Unsupervised Scoring for Tensors and its Application to Phenotyping

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

SPARTan: Scalable PARAFAC2 for Large & Sparse Data

COPA: Constrained PARAFAC2 for Sparse & Large Datasets

Boolean Tensor Factorizations

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media