ABSTRACT
Probabilistic topic modeling provides a suite of tools for the unsupervised analysis of large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. This analysis can be used for corpus exploration, document search, and a variety of prediction problems.
In this tutorial, I will review the state-of-the-art in probabilistic topic models. I will describe the three components of topic modeling:
(1) Topic modeling assumptions
(2) Algorithms for computing with topic models
(3) Applications of topic models
In (1), I will describe latent Dirichlet allocation (LDA), which is one of the simplest topic models, and then describe a variety of ways that we can build on it. These include dynamic topic models, correlated topic models, supervised topic models, author-topic models, bursty topic models, Bayesian nonparametric topic models, and others. I will also discuss some of the fundamental statistical ideas that are used in building topic models, such as distributions on the simplex, hierarchical Bayesian modeling, and models of mixed-membership.
In (2), I will review how we compute with topic models. I will describe approximate posterior inference for directed graphical models using both sampling and variational inference, and I will discuss the practical issues and pitfalls in developing these algorithms for topic models. Finally, I will describe some of our most recent work on building algorithms that can scale to millions of documents and documents arriving in a stream.
In (3), I will discuss applications of topic models. These include applications to images, music, social networks, and other data in which we hope to uncover hidden patterns. I will describe some of our recent work on adapting topic modeling algorithms to collaborative filtering, legislative modeling, and bibliometrics without citations.
Finally, I will discuss some future directions and open research problems in topic models.
Supplemental Material
Recommendations
Tutorial on probabilistic topic models
COMAD '11: Proceedings of the 17th International Conference on Management of DataOver the last decade, probabilistic topic models have emerged as an extremely powerful and popular tool for analyzing large collections of unstructured data. While originally proposed for textual data, topic models have since been applied for various ...
Topic Models with Topic Ordering Regularities for Topic Segmentation
ICDM '14: Proceedings of the 2014 IEEE International Conference on Data MiningDocuments from the same domain usually discuss similar topics in a similar order. In this paper we present new ordering-based topic models that use generalised Mallows models to capture this regularity to constrain topic assignments. Specifically, these ...
Probabilistic author-topic models for information discovery
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningWe propose a new unsupervised learning technique for extracting information from large text collections. We model documents as if they were generated by a two-stage stochastic process. Each author is represented by a probability distribution over topics,...
Comments