skip to main content
10.1145/1255175.1255249acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
Article

Organizing the OCA: learning faceted subjects from a library of digital books

Published:18 June 2007Publication History

ABSTRACT

Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who already know exactly what they want. Statistical topic models provide a complementary function. These models can identify semantically coherent "topics" that are easily recognizable and meaningful to humans, but they have been too computationally intensive to run on library-scale corpora. This paper presents DCM-LDA, a topic model based on Dirichlet Compound Multinomial distributions. This model is simultaneously better able to represent observed properties of text and more scalable to extremely large text collections. We train individual topic models for each book based on the cooccurrence of words within pages. We then cluster topics across books. The resulting topical clusters can be interpreted as subject facets, allowing readers to browse the topics of a collection quickly, find relevant books using topically expanded keyword searches, and explore topical relationships between books. We demonstrate this method finding topics on a corpus of 1.49 billion words from 42,000 books in less than 20 hours, and it easily could scale well beyond this.

References

  1. D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W. Buntine, S. Perttu, and H. Tirri. Building and maintaining web taxonomies. In XML Finland 2002, 2002.Google ScholarGoogle Scholar
  3. California Digital Library. The Melvyl Recommender project full text extension supplementary report. http://www.cdlib.org/inside/projects/melvyl_recommender/report_docs/mellon_extension.pdf.Google ScholarGoogle Scholar
  4. G. Celeux, D. Chauveau, and J. Diebolt. On stochastic versions of the EM algorithm. Technical Report RR-2514, INRIA.Google ScholarGoogle Scholar
  5. C. Elkan. Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In ICML 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Frank and G. W. Paynter. Predicting library of congress classifications from library of congress subject headings. J. Am. Soc. Inf. Sci. Technol., 55(3):214--227, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. J. Godby and J. Stuler. The Library of Congress Classification as a knowledge base for automatic classification. In IFLA Preconference, 2001.Google ScholarGoogle Scholar
  8. J. Goldberger and S. Roweis. Hierarchical clustering of a mixture model. In NIPS 2004, 2004.Google ScholarGoogle Scholar
  9. Google Books. http://books.google.com.Google ScholarGoogle Scholar
  10. M. Hearst. Clustering versus faceted categories for information exploration. Communications of the ACM, 49(4):59--61, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Internet Archive. http://www.archive.org/texts.Google ScholarGoogle Scholar
  12. A. Krowne and M. Halbert. An initial evaluation of automated organization for digital library browsing. In JCDL 2005, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Madsen, D. Kauchak, and C. Elkan. Modeling word burstiness using the Dirichlet distribution. In ICML 2005, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. K. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu, 2002.Google ScholarGoogle Scholar
  15. T. Minka. Estimating a Dirichlet distribution, 2000.Google ScholarGoogle Scholar
  16. D. Newman. American west metadata enhancement feasibility study, 2005. http://www.cdlib.org/inside/projects/amwest/cluster.pdf.Google ScholarGoogle Scholar
  17. Open Content Alliance. http://www.opencontentalliance.org/.Google ScholarGoogle Scholar
  18. A. Rauber and D. Merkl. Text mining in the SOMLib digital library system: the representation of topics and genres. Applied Intelligence, 18:271--293, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. W. Teh, M. Jordan, M. Beal, and D. Blei. Sharing clusters among related groups: Hierarchical Dirichlet processes. In NIPS 2004, 2004.Google ScholarGoogle Scholar
  20. S. Veeramachaneni, D. Sona, and P. Avesani. Hierarchical Dirichlet model for document classification. In ICML 2005, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Wei and B. Croft. LDA-based document models for ad-hoc retrieval. In SIGIR 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Battle of Chancellorsville, Battle of Gettysburg. Wikipedia, accessed 2007. http://en.wikipedia.org/.Google ScholarGoogle Scholar
  23. C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In KDD 2004, pages 743--748, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Organizing the OCA: learning faceted subjects from a library of digital books

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        JCDL '07: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries
        June 2007
        534 pages
        ISBN:9781595936448
        DOI:10.1145/1255175

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 June 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate415of1,482submissions,28%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader