article

Free Access

Distributed Algorithms for Topic Models

The Journal of Machine Learning Research Volume 10pp 1801–1828

Published:01 December 2009Publication History

The Journal of Machine Learning Research

Abstract

We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.

References

M. Abramowitz and I. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964. Google ScholarDigital Library
A. Asuncion and D. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 14, pages 601-608, Cambridge, MA, 2002. MIT Press.Google Scholar
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003. Google ScholarDigital Library
A. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational & Graphical Statistics, 15, No. 1:246-261, 2006.Google ScholarCross Ref
R. Burkard and E. Ç ela. Linear assignment problems and extensions. In P. Pardalos and D. Du, editors, Handbook of Combinatorial Optimization, Supplement Volume A. Kluwer Academic Publishers, 1999.Google Scholar
G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81-94, 1996.Google Scholar
C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, pages 241-248. MIT Press, Cambridge, MA, 2007.Google Scholar
Chu, Kim, Lin, Yu, Bradski, Ng, and Olukotun. Map-reduce for machine learning on multicore. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281-288. MIT Press, Cambridge, MA, 2007.Google Scholar
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 271-280, New York, NY, 2007. ACM. Google ScholarDigital Library
M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577-588, 1995. URL citeseer.ist.psu.edu/escobar94bayesian.html.Google Scholar
G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. In ACM KDD Explorations, volume 2, pages 34-38, New York, NY, 2000. ACM. Google ScholarDigital Library
T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228-5235, 2004.Google ScholarCross Ref
E. Kontoghiorghes. Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs). Chapman & Hall / CRC, 2005. Google ScholarDigital Library
W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the International Conference onMachine Learning, volume 23, pages 577-584, New York, NY, 2006. ACM. Google ScholarDigital Library
J. Liu, W. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27-40, 1994.Google Scholar
D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on digital libraries, pages 376-385, New York, NY, 2007. ACM. Google ScholarDigital Library
R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW '07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 349-354, Washington, DC, 2007. IEEE Computer Society. Google ScholarDigital Library
P. Ferrari, A. Frigessi, and R. Schonmann. Convergence of some partially parallel Gibbs samplers with annealing. In Annals of Applied Probability, volume 3, pages 137-153. Institute of Mathematical Statistics, 1993.Google Scholar
M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, pages 487-494, Arlington, VA, 2004. AUAI Press. Google ScholarDigital Library
A. Rossini, L. Tierney, and N. Li. Simple parallel statistical computing in R. Journal of Computational & Graphical Statistics, 16(2):399, 2007.Google ScholarCross Ref
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566-1581, 2006.Google Scholar
W. Kowalczyk and N. Vlassis. Newscast EM. In Advances in Neural Information Processing Systems 17, pages 713-720. MIT Press, Cambridge, MA, 2005.Google Scholar
J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the International Conference on Machine Learning, pages 1184-1191. ACM, New York, NY, 2008. Google ScholarDigital Library
L. Younes. Synchronous random fields and image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):380-390, 1998. Google ScholarDigital Library

Index Terms

Distributed Algorithms for Topic Models
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

Topic 8 Distributed Systems and Algorithms
Euro-Par '07: Proceedings of the 13th European international conference on Parallel Processing

Parallel computing is increasingly exposed to the development and challenges of distributed systems, such as the lack of load balancing, asynchrony, long latencies, network partitions, failures, disconnected operations, heterogeneity and protocol ...
Read More
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge management

Readers of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Read More
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Topic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

The Journal of Machine Learning Research Volume 10, Issue
12/1/2009
2936 pages
ISSN:1532-4435
EISSN:1533-7928
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
JMLR.org
Publication History
- Published: 1 December 2009
Published in jmlr Volume 10, Issue
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 109
  Total Citations
  View Citations
- 562
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed Algorithms for Topic Models

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Topic 8 Distributed Systems and Algorithms

Topic-driven reader comments summarization

Topic analysis for topic-focused multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributed Algorithms for Topic Models

The Journal of Machine Learning Research

Abstract

References

Cited By

Index Terms

Recommendations

Topic 8 Distributed Systems and Algorithms

Topic-driven reader comments summarization

Topic analysis for topic-focused multi-document summarization

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media