Article

Organizing the OCA: learning faceted subjects from a library of digital books

Authors:
David Mimno

University of Massachusetts: Amherst, Amherst, MA

University of Massachusetts: Amherst, Amherst, MA
View Profile

,
Andrew McCallum

University of Massachusetts: Amherst, Amherst, MA

University of Massachusetts: Amherst, Amherst, MA
View Profile

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital librariesJune 2007Pages 376–385https://doi.org/10.1145/1255175.1255249

Published:18 June 2007Publication History

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

Pages 376–385

ABSTRACT

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.

References

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarDigital Library
W. Buntine, S. Perttu, and H. Tirri. Building and maintaining web taxonomies. In XML Finland 2002, 2002.Google Scholar
California Digital Library. The Melvyl Recommender project full text extension supplementary report. http://www.cdlib.org/inside/projects/melvyl_recommender/report_docs/mellon_extension.pdf.Google Scholar
G. Celeux, D. Chauveau, and J. Diebolt. On stochastic versions of the EM algorithm. Technical Report RR-2514, INRIA.Google Scholar
C. Elkan. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In ICML 2006, 2006. Google ScholarDigital Library
E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol., 55(3):214--227, 2004. Google ScholarDigital Library
C. J. Godby and J. Stuler. The Library of Congress Classification as a knowledge base for automatic classification. In IFLA Preconference, 2001.Google Scholar
J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In NIPS 2004, 2004.Google Scholar
Google Books. http://books.google.com.Google Scholar
M. Hearst. Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4):59--61, 2006. Google ScholarDigital Library
Internet Archive. http://www.archive.org/texts.Google Scholar
A. Krowne and M. Halbert. An initial evaluation of automated organization for digital library browsing. In JCDL 2005, 2005. Google ScholarDigital Library
R. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In ICML 2005, 2005. Google ScholarDigital Library
A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google Scholar
T. Minka. Estimating a Dirichlet distribution, 2000.Google Scholar
D. Newman. American west metadata enhancement feasibility study, 2005. http://www.cdlib.org/inside/projects/amwest/cluster.pdf.Google Scholar
Open Content Alliance. http://www.opencontentalliance.org/.Google Scholar
A. Rauber and D. Merkl. Text mining in the SOMLib digital library system: the representation of topics and genres. Applied Intelligence, 18:271--293, 2003. Google ScholarDigital Library
Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 2004, 2004.Google Scholar
S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical Dirichlet model for document classification. In ICML 2005, 2005. Google ScholarDigital Library
X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR 2006, 2006. Google ScholarDigital Library
Battle of Chancellorsville, Battle of Gettysburg. Wikipedia, accessed 2007. http://en.wikipedia.org/.Google Scholar
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD 2004, pages 743--748, 2004. Google ScholarDigital Library

Index Terms

Organizing the OCA: learning faceted subjects from a library of digital books
1. Applied computing
  1. Computers in other domains
    1. Digital libraries and archives
2. Information systems
  1. Information systems applications
    1. Digital libraries and archives

Recommendations

Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

We propose Topic Anchoring-based Review Summarization (TARS), a two-step extractive summarization method, which creates review summaries from the sentences that represent the most important aspects of a review. In the first step, the proposed method ...
Read More
Expression microarray classification using topic models
SAC '10: Proceedings of the 2010 ACM Symposium on Applied Computing

Classification of samples in expression microarray experiments represents a crucial task in bioinformatics and biomedicine. In this paper this scenario is addressed by employing a particular class of statistical approaches, called Topic Models. These ...
Read More
Topic sentiment mixture: modeling facets and opinions in weblogs
WWW '07: Proceedings of the 16th international conference on World Wide Web

In this paper, we define the problem of topic-sentiment analysis on Weblogs and propose a novel probabilistic model to capture the mixture of topics and sentiments simultaneously. The proposed Topic-Sentiment Mixture (TSM) model can reveal the latent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
June 2007
534 pages
ISBN:9781595936448
DOI:10.1145/1255175
General Chair:
Edie Rasmussen
University of British Columbia, Canada
,
Program Chairs:
Ray R. Larson
University of California, Berkeley
,
Elaine Toms
Dalhousie University, Canada
,
Shigeo Sugimoto
University of Tsukuba, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
classification
topic models
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate415of1,482submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 66
  Total Citations
  View Citations
- 555
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Organizing the OCA: learning faceted subjects from a library of digital books

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization

Expression microarray classification using topic models

Topic sentiment mixture: modeling facets and opinions in weblogs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Organizing the OCA: learning faceted subjects from a library of digital books

JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sentence Retrieval with Sentiment-specific Topical Anchoring for Review Summarization

Expression microarray classification using topic models

Topic sentiment mixture: modeling facets and opinions in weblogs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media