skip to main content
10.1145/1772690.1772760acmotherconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Published: 26 April 2010 Publication History

Abstract

The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scalability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factorization (NMF) technique. We show that by carefully partitioning the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result effectively assures practitioners of the scalability of NMF on Web-scale dyadic data.

References

[1]
D. Agarwal and S. Merugu. Predictive discrete latent factor models for large scale dyadic data. In KDD, 2007.
[2]
E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by incorporating user behavior information. In SIGIR, pages 19--26, 2006.
[3]
R. Baeza-Yates, C. Hurtado, and M. Mendoza. Query recommendation using query logs in search engines. EDBT Workshops, pages 588--596, 2004.
[4]
M. W. Berry, M. Browne, A. N. Langville, P. V. Pauca, and R. J. Plemmons. Algorithms and applications for approximate nonnegative matrix factorization. Computational Statistics & Data Analysis, 52(1):155--173, September 2007.
[5]
E. Y. Chang, K. Zhu, and H. Bai. Parallel algorithms for mining large-scale datasets. In CIKM, 2009.
[6]
E. Y. Chang, K. Zhu, H. Wang, H. Bai, J. Li, Z. Qiu, and H. Cui. PSVM: Parallelizing support vector machines on distributed computers. In NIPS, 2007.
[7]
D. Chen. Efficient geometric algorithms on the EREW PRAM. IEEE Trans. Parallel Distrib. Syst., 1995.
[8]
W.-Y. Chen, J.-C. Chu, J. Luan, H. Bai, Y. Wang, and E. Y. Chang. Collaborative filtering for orkut communities: discovery of user latent behavior. In WWW, 2009.
[9]
Y. Chen, D. Pavlov, and J. F. Canny. Large-scale behavioral targeting. In KDD, pages 209--218, 2009.
[10]
C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.
[11]
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007.
[12]
W. L. C. David A. Kenny, Deborah A. Kashy. Query clustering using user logs. ACM Trans. Inf. Syst., 20(1):59--81, 2002.
[13]
W. L. C. David A. Kenny, Deborah A. Kashy. Dyadic Data Analysis. The Guilford Press, 2006.
[14]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[15]
S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41:391--407, 1990.
[16]
I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD, pages 269--274, 2001.
[17]
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In KDD, pages 126--135, 2006.
[18]
Q. Dong, X. Wang, and L. Lin. Application of latent semantic analysis to protein remote homology detection. Bioinformatics, 22(3):285--290, 2005.
[19]
S. Filippone and M. Colajanni. PSBLAS: a library for parallel linear algebra computation on sparse matrices. ACM Trans. Math. Softw., 26(4):527--550, 2000.
[20]
D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 18(1):145--154, 2002.
[21]
T. Hofmann, J. Puzicha, and M. I. Jordan. Learning from dyadic data. In NIPS, pages 466--472, 1999.
[22]
P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res., 5, 2004.
[23]
K. Kanjani. Parallel non negative matrix factorization for document clustering. Technical report, Texas A & M University, May 2007.
[24]
D. Kim, S. Sra, and I. S. Dhillon. Fast newton-type methods for the least squares nonnegative matrix approximation problem. In SDM, 2007.
[25]
D. Kim, S. Sra, and I. S. Dhillon. Fast projection based methods for the least squares nonnegative matrix approximation problem. Statistical Analysis and Data Mining, 1(1), 2008.
[26]
Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30--37, 2009.
[27]
W. Kowalczyk and N. Vlassis. Newscast em. In In NIPS 17, pages 713--720. MIT Press, 2005.
[28]
D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.
[29]
D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS, 2000.
[30]
S. Li, X. Hou, H. Zhang, and Q. Cheng. Learning spatially localized, parts-based representation. CVPR, 1:207, 2001.
[31]
X. Li, C. Snoek, and M. Worring. Learning tag relevance by neighbor voting for social image retrieval. In Proc. of the 1st ACM international conference on Multimedia information retrieval (MIR '08), pages 180--187, 2008.
[32]
Y. Liu, B. Gao, T.-Y. Liu, Y. Zhang, Z. Ma, S. He, and H. Li. BrowseRank: letting web users vote for page importance. In SIGIR, pages 451--458, 2008.
[33]
R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational em for latent dirichlet allocation: An experimental evaluation of speed and scalability. Proc. of the 7th International Conference on Data Mining Workshops (ICDMW'07), pages 349--354, 2007.
[34]
P. Pacheco. Parallel Programming with MPI. Morgan Kaufmann, 1997.
[35]
J. Rennie and N. Srebro. Fast maximum margin matrix factorization for collaborative prediction. In ICML, pages 713--719, 2005.
[36]
S. A. Robila and L. G. Maciak. A parallel unmixing algorithm for hyperspectral images. In Intelligent Robots and Computer Vision XXIV, 2006.
[37]
M. N. Schmidt, O. Winther, and L. K. Hansen. Bayesian non-negative matrix factorization. In Independent Component Analysis and Signal Separation, pages 540--547, 2009.
[38]
F. Shahnaz, M. W. Berry, V. P. Pauca, and R. J. Plemmons. Document clustering using nonnegative matrix factorization. Inf. Process. Manage., 42(2), 2006.
[39]
A. P. Singh and G. J. Gordon. A unified view of matrix factorization models. In PKDD, pages 358--373, 2008.
[40]
S. Sra and I. S. Dhillon. Nonnegative matrix approximation: Algorithms and applications. Technical report, CS Department, University of Texas, June 2006.
[41]
Y. Wang, H. Bai, M. Stanton, W.-Y. Chen, and E. Y. Chang. Plda: Parallel latent dirichlet allocation for large-scale applications. In ICAAIM, 2009.
[42]
J. Wolfe, A. Haghighi, and D. Klein. Fully distributed em for very large datasets. In ICML, 2008.
[43]
W. Xu, X. Liu, and Y. Gong. Document clustering based on non-negative matrix factorization. In SIGIR, pages 267--273, 2003.
[44]
K. Yu, S. Zhu, J. Lafferty, and Y. Gong. Fast nonparametric matrix factorization for large-scale collaborative filtering. In SIGIR, pages 211--218, 2009.

Cited By

View all
  • (2024)Clustering Network Traffic Using Semi-Supervised LearningElectronics10.3390/electronics1314276913:14(2769)Online publication date: 14-Jul-2024
  • (2024)A Framework for Compressed Weighted Nonnegative Matrix FactorizationIEEE Transactions on Signal Processing10.1109/TSP.2024.346983072(4798-4811)Online publication date: 1-Jan-2024
  • (2024)Consensus Subspace Graph Regularization based on prior information for multiplex network clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108851135(108851)Online publication date: Sep-2024
  • Show More Cited By

Index Terms

  1. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    WWW '10: Proceedings of the 19th international conference on World wide web
    April 2010
    1407 pages
    ISBN:9781605587998
    DOI:10.1145/1772690

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 26 April 2010

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. distributed computing
    2. dyadic data
    3. mapreduce
    4. nonnegative matrix factorization

    Qualifiers

    • Research-article

    Conference

    WWW '10
    WWW '10: The 19th International World Wide Web Conference
    April 26 - 30, 2010
    North Carolina, Raleigh, USA

    Acceptance Rates

    Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)47
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Clustering Network Traffic Using Semi-Supervised LearningElectronics10.3390/electronics1314276913:14(2769)Online publication date: 14-Jul-2024
    • (2024)A Framework for Compressed Weighted Nonnegative Matrix FactorizationIEEE Transactions on Signal Processing10.1109/TSP.2024.346983072(4798-4811)Online publication date: 1-Jan-2024
    • (2024)Consensus Subspace Graph Regularization based on prior information for multiplex network clusteringEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108851135(108851)Online publication date: Sep-2024
    • (2024)Semi-supervised nonnegative matrix factorization with label propagation and constraint propagationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.108196133:PCOnline publication date: 1-Jul-2024
    • (2023)FedRule: Federated Rule Recommendation System with Graph Neural NetworksProceedings of the 8th ACM/IEEE Conference on Internet of Things Design and Implementation10.1145/3576842.3582328(197-208)Online publication date: 9-May-2023
    • (2023)FINISH: Efficient and Scalable NMF-Based Federated Learning for Detecting Malware ActivitiesIEEE Transactions on Emerging Topics in Computing10.1109/TETC.2023.329292411:4(934-949)Online publication date: Oct-2023
    • (2022)Robust Discriminative Non-Negative Matrix Factorization with Maximum Correntropy CriterionArtificial Intelligence and Applications10.5121/csit.2022.121804(37-49)Online publication date: 29-Oct-2022
    • (2022)Federated Matrix Factorization: Algorithm Design and Application to Data ClusteringIEEE Transactions on Signal Processing10.1109/TSP.2022.315150570(1625-1640)Online publication date: 2022
    • (2022)Distributed Bayesian Matrix Decomposition for Big Data Mining and ClusteringIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.302958234:8(3701-3713)Online publication date: 1-Aug-2022
    • (2022)Fast and Secure Distributed Nonnegative Matrix FactorizationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2020.298596434:2(653-666)Online publication date: 1-Feb-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    EPUB

    View this article in ePub.

    ePub

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media