Abstract
We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer.
- M. Abramowitz and I. Stegun. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York, 1964. Google ScholarDigital Library
- A. Asuncion and D. Newman. UCI machine learning repository, 2007. URL http://www.ics.uci.edu/~mlearn/MLRepository.html.Google Scholar
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. In Advances in Neural Information Processing Systems, volume 14, pages 601-608, Cambridge, MA, 2002. MIT Press.Google Scholar
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993-1022, 2003. Google ScholarDigital Library
- A. Brockwell. Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational & Graphical Statistics, 15, No. 1:246-261, 2006.Google ScholarCross Ref
- R. Burkard and E. Ç ela. Linear assignment problems and extensions. In P. Pardalos and D. Du, editors, Handbook of Combinatorial Optimization, Supplement Volume A. Kluwer Academic Publishers, 1999.Google Scholar
- G. Casella and C. Robert. Rao-Blackwellisation of sampling schemes. Biometrika, 83(1):81-94, 1996.Google Scholar
- C. Chemudugunta, P. Smyth, and M. Steyvers. Modeling general and specific aspects of documents with a probabilistic topic model. In Advances in Neural Information Processing Systems 19, pages 241-248. MIT Press, Cambridge, MA, 2007.Google Scholar
- Chu, Kim, Lin, Yu, Bradski, Ng, and Olukotun. Map-reduce for machine learning on multicore. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 281-288. MIT Press, Cambridge, MA, 2007.Google Scholar
- A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: Scalable online collaborative filtering. In WWW '07: Proceedings of the 16th International Conference on World Wide Web, pages 271-280, New York, NY, 2007. ACM. Google ScholarDigital Library
- M. D. Escobar and M. West. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90(430):577-588, 1995. URL citeseer.ist.psu.edu/escobar94bayesian.html.Google Scholar
- G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. In ACM KDD Explorations, volume 2, pages 34-38, New York, NY, 2000. ACM. Google ScholarDigital Library
- T. Griffiths and M. Steyvers. Finding scientific topics. In Proceedings of the National Academy of Sciences, volume 101, pages 5228-5235, 2004.Google ScholarCross Ref
- E. Kontoghiorghes. Handbook of Parallel Computing and Statistics (Statistics, Textbooks and Monographs). Chapman & Hall / CRC, 2005. Google ScholarDigital Library
- W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the International Conference onMachine Learning, volume 23, pages 577-584, New York, NY, 2006. ACM. Google ScholarDigital Library
- J. Liu, W. Wong, and A. Kong. Covariance structure of the Gibbs sampler with applications to the comparisons of estimators and augmentation schemes. Biometrika, 81(1):27-40, 1994.Google Scholar
- D. Mimno and A. McCallum. Organizing the OCA: Learning faceted subjects from a library of digital books. In JCDL '07: Proceedings of the 2007 conference on digital libraries, pages 376-385, New York, NY, 2007. ACM. Google ScholarDigital Library
- R. Nallapati, W. Cohen, and J. Lafferty. Parallelized variational EM for latent Dirichlet allocation: An experimental evaluation of speed and scalability. In ICDMW '07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 349-354, Washington, DC, 2007. IEEE Computer Society. Google ScholarDigital Library
- P. Ferrari, A. Frigessi, and R. Schonmann. Convergence of some partially parallel Gibbs samplers with annealing. In Annals of Applied Probability, volume 3, pages 137-153. Institute of Mathematical Statistics, 1993.Google Scholar
- M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, volume 20, pages 487-494, Arlington, VA, 2004. AUAI Press. Google ScholarDigital Library
- A. Rossini, L. Tierney, and N. Li. Simple parallel statistical computing in R. Journal of Computational & Graphical Statistics, 16(2):399, 2007.Google ScholarCross Ref
- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566-1581, 2006.Google Scholar
- W. Kowalczyk and N. Vlassis. Newscast EM. In Advances in Neural Information Processing Systems 17, pages 713-720. MIT Press, Cambridge, MA, 2005.Google Scholar
- J. Wolfe, A. Haghighi, and D. Klein. Fully distributed EM for very large datasets. In Proceedings of the International Conference on Machine Learning, pages 1184-1191. ACM, New York, NY, 2008. Google ScholarDigital Library
- L. Younes. Synchronous random fields and image restoration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(4):380-390, 1998. Google ScholarDigital Library
Index Terms
- Distributed Algorithms for Topic Models
Recommendations
Topic 8 Distributed Systems and Algorithms
Euro-Par '07: Proceedings of the 13th European international conference on Parallel ProcessingParallel computing is increasingly exposed to the development and challenges of distributed systems, such as the lack of load balancing, asynchrony, long latencies, network partitions, failures, disconnected operations, heterogeneity and protocol ...
Topic-driven reader comments summarization
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementReaders of a news article often read its comments contributed by other readers. By reading comments, readers obtain not only complementary information about this news article but also the opinions from other readers. However, the existing ranking ...
Topic analysis for topic-focused multi-document summarization
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementTopic-focused multi-document summarization has been a challenging task because the created summary is required to be biased to the given topic or query. Existing methods consider the given topic as a single coarse unit and then directly incorporate the ...
Comments