ABSTRACT
Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
- W. Buntine, S. Perttu, and H. Tirri. Building and maintaining web taxonomies. In XML Finland 2002, 2002.Google Scholar
- California Digital Library. The Melvyl Recommender project full text extension supplementary report. http://www.cdlib.org/inside/projects/melvyl_recommender/report_docs/mellon_extension.pdf.Google Scholar
- G. Celeux, D. Chauveau, and J. Diebolt. On stochastic versions of the EM algorithm. Technical Report RR-2514, INRIA.Google Scholar
- C. Elkan. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In ICML 2006, 2006. Google ScholarDigital Library
- E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol., 55(3):214--227, 2004. Google ScholarDigital Library
- C. J. Godby and J. Stuler. The Library of Congress Classification as a knowledge base for automatic classification. In IFLA Preconference, 2001.Google Scholar
- J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In NIPS 2004, 2004.Google Scholar
- Google Books. http://books.google.com.Google Scholar
- M. Hearst. Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4):59--61, 2006. Google ScholarDigital Library
- Internet Archive. http://www.archive.org/texts.Google Scholar
- A. Krowne and M. Halbert. An initial evaluation of automated organization for digital library browsing. In JCDL 2005, 2005. Google ScholarDigital Library
- R. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In ICML 2005, 2005. Google ScholarDigital Library
- A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
- T. Minka. Estimating a Dirichlet distribution, 2000.Google Scholar
- D. Newman. American west metadata enhancement feasibility study, 2005. http://www.cdlib.org/inside/projects/amwest/cluster.pdf.Google Scholar
- Open Content Alliance. http://www.opencontentalliance.org/.Google Scholar
- A. Rauber and D. Merkl. Text mining in the SOMLib digital library system: the representation of topics and genres. Applied Intelligence, 18:271--293, 2003. Google ScholarDigital Library
- Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 2004, 2004.Google Scholar
- S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical Dirichlet model for document classification. In ICML 2005, 2005. Google ScholarDigital Library
- X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR 2006, 2006. Google ScholarDigital Library
- Battle of Chancellorsville, Battle of Gettysburg. Wikipedia, accessed 2007. http://en.wikipedia.org/.Google Scholar
- C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD 2004, pages 743--748, 2004. Google ScholarDigital Library
Index Terms
- Organizing the OCA: learning faceted subjects from a library of digital books
Recommendations
Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge ManagementWe propose Topic Anchoring-based Review Summarization (TARS), a two-step extractive summarization method, which creates review summaries from the sentences that represent the most important aspects of a review. In the first step, the proposed method ...
Expression microarray classification using topic models
SAC '10: Proceedings of the 2010 ACM Symposium on Applied ComputingClassification of samples in expression microarray experiments represents a crucial task in bioinformatics and biomedicine. In this paper this scenario is addressed by employing a particular class of statistical approaches, called Topic Models. These ...
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide WebIn this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Comments