skip to main content
article
Free Access

Distributed Algorithms for Topic Models

Published:01 December 2009Publication History
Skip Abstract Section

Abstract

We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.

References

  1. M. Abramowitz and I. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Asuncion and D. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.Google ScholarGoogle Scholar
  3. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 14, pages 601-608, Cambridge, MA, 2002. MIT Press.Google ScholarGoogle Scholar
  4. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational & Graphical Statistics, 15, No. 1:246-261, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  6. R. Burkard and E. Ç ela. Linear assignment problems and extensions. In P. Pardalos and D. Du, editors, Handbook of Combinatorial Optimization, Supplement Volume A. Kluwer Academic Publishers, 1999.Google ScholarGoogle Scholar
  7. G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81-94, 1996.Google ScholarGoogle Scholar
  8. C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, pages 241-248. MIT Press, Cambridge, MA, 2007.Google ScholarGoogle Scholar
  9. Chu, Kim, Lin, Yu, Bradski, Ng, and Olukotun. Map-reduce for machine learning on multicore. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281-288. MIT Press, Cambridge, MA, 2007.Google ScholarGoogle Scholar
  10. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 271-280, New York, NY, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577-588, 1995. URL citeseer.ist.psu.edu/escobar94bayesian.html.Google ScholarGoogle Scholar
  12. G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. In ACM KDD Explorations, volume 2, pages 34-38, New York, NY, 2000. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228-5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. E. Kontoghiorghes. Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs). Chapman & Hall / CRC, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the International Conference onMachine Learning, volume 23, pages 577-584, New York, NY, 2006. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Liu, W. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27-40, 1994.Google ScholarGoogle Scholar
  17. D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on digital libraries, pages 376-385, New York, NY, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW '07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 349-354, Washington, DC, 2007. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Ferrari, A. Frigessi, and R. Schonmann. Convergence of some partially parallel Gibbs samplers with annealing. In Annals of Applied Probability, volume 3, pages 137-153. Institute of Mathematical Statistics, 1993.Google ScholarGoogle Scholar
  20. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, pages 487-494, Arlington, VA, 2004. AUAI Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Rossini, L. Tierney, and N. Li. Simple parallel statistical computing in R. Journal of Computational & Graphical Statistics, 16(2):399, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  22. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566-1581, 2006.Google ScholarGoogle Scholar
  23. W. Kowalczyk and N. Vlassis. Newscast EM. In Advances in Neural Information Processing Systems 17, pages 713-720. MIT Press, Cambridge, MA, 2005.Google ScholarGoogle Scholar
  24. J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the International Conference on Machine Learning, pages 1184-1191. ACM, New York, NY, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. L. Younes. Synchronous random fields and image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):380-390, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed Algorithms for Topic Models
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image The Journal of Machine Learning Research
      The Journal of Machine Learning Research  Volume 10, Issue
      12/1/2009
      2936 pages
      ISSN:1532-4435
      EISSN:1533-7928
      Issue’s Table of Contents

      Publisher

      JMLR.org

      Publication History

      • Published: 1 December 2009
      Published in jmlr Volume 10, Issue

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader