skip to main content
10.1145/1816123.1816156acmconferencesArticle/Chapter ViewAbstractPublication PagesjcdlConference Proceedingsconference-collections
research-article

Evaluating topic models for digital libraries

Published: 21 June 2010 Publication History

Abstract

Topic models could have a huge impact on improving the ways users find and discover content in digital libraries and search interfaces through their ability to automatically learn and apply subject tags to each and every item in a collection, and their ability to dynamically create virtual collections on the fly. However, much remains to be done to tap this potential, and empirically evaluate the true value of a given topic model to humans. In this work, we sketch out some sub-tasks that we suggest pave the way towards this goal, and present methods for assessing the coherence and interpretability of topics learned by topic models. Our large-scale user study includes over 70 human subjects evaluating and scoring almost 500 topics learned from collections from a wide range of genres and domains. We show how scoring model -- based on pointwise mutual information of word-pair using Wikipedia, Google and MEDLINE as external data sources - performs well at predicting human scores. This automated scoring of topics is an important first step to integrating topic modeling into digital libraries

References

[1]
L. AlSumait, D. Barbará, J. Gentle, and C. Domeniconi. Topic significance ranking of LDA generative models. In ECML/PKDD (1), pages 67--82, 2009.
[2]
D. Andrzejewski, X. Zhu, and M. Craven. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In ICML, page 4, 2009.
[3]
T. Armstrong, A. Moffat, W. Webber, and J. Zobel. Improvements that don't add up: ad-hoc retrieval results since 1998. In CIKM, pages 601--610, 2009.
[4]
D. Blei and J. Lafferty. Dynamic topic models. In ICML, pages 113--120, 2006.
[5]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003.
[6]
W. Buntine and A. Jakulin. Applying discrete PCA in data analysis. In UAI, pages 59--66, Banff, Canada, 2004.
[7]
J. Chang, J. Boyd-Graber, S. Gerrish, C. Wang, and D. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288--296, 2009.
[8]
T. Griffiths and M. Steyvers. Finding scientific topics. In PNAS, volume 101, pages 5228--5235, 2004.
[9]
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In SIGKDD, pages 490--499, 2007.
[10]
D. Mimno and A. McCallum. Organizing the OCA: learning faceted subjects from a library of digital books. In JCDL, pages 376--385, 2007.
[11]
D. Newman, T. Baldwin, L. Cavedon, S. Karimi, D. Martinez, and J. Zobel. Visualizing document collections and search results using topic mapping. Journal of Web Semantics, to appear.
[12]
D. Newman, K. Hagedorn, C. Chemudugunta, and P. Smyth. Subject metadata enrichment using statistical topic models. In JCDL, pages 366--375, 2007.
[13]
D. Newman, S. Karimi, and L. Cavedon. External evaluation of topic models. In ADCS, pages 11--18, 2009.
[14]
D. Newman, J. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL HLT 2010, Los Angeles, USA, to appear.
[15]
Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. JASA, 101(476):1566--1581, 2006.
[16]
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. In NIPS, pages 1973--1981, 2009.

Cited By

View all
  • (2025)Study of technology communities and dominant technology lock-in in the Internet of Things domain - Based on social network analysis of patent networkInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10395962:1Online publication date: 1-Jan-2025
  • (2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
  • (2024)The Impact of Performance Reporting on Investment Behavior: Evidence from Disclosure Reform in the U.K.The Accounting Review10.2308/TAR-2021-086399:4(427-453)Online publication date: 15-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
JCDL '10: Proceedings of the 10th annual joint conference on Digital libraries
June 2010
424 pages
ISBN:9781450300858
DOI:10.1145/1816123
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 June 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. evaluation
  2. topic models
  3. topic quality
  4. user studies

Qualifiers

  • Research-article

Conference

JCDL10
Sponsor:
JCDL10: Joint Conference on Digital Libraries
June 21 - 25, 2010
Queensland, Gold Coast, Australia

Acceptance Rates

Overall Acceptance Rate 415 of 1,482 submissions, 28%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)6
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Study of technology communities and dominant technology lock-in in the Internet of Things domain - Based on social network analysis of patent networkInformation Processing and Management: an International Journal10.1016/j.ipm.2024.10395962:1Online publication date: 1-Jan-2025
  • (2024)Data lake management using topic modeling techniquesData and Metadata10.56294/dm20242823(282)Online publication date: 15-Apr-2024
  • (2024)The Impact of Performance Reporting on Investment Behavior: Evidence from Disclosure Reform in the U.K.The Accounting Review10.2308/TAR-2021-086399:4(427-453)Online publication date: 15-Jun-2024
  • (2024)Forecasting Macro-Economic Indicators Based on Text Information from Strategic Management Field in RussiaVestnik of the Plekhanov Russian University of Economics10.21686/2413-2829-2024-3-38-53(38-53)Online publication date: 22-May-2024
  • (2024)Dynamic topic language model on heterogeneous children’s mental health clinical notesThe Annals of Applied Statistics10.1214/24-AOAS193018:4Online publication date: 1-Dec-2024
  • (2024)Tagging Items with Emerging Tags: A Neural Topic Model Based Few-Shot Learning ApproachACM Transactions on Information Systems10.1145/364185942:4(1-37)Online publication date: 23-Jan-2024
  • (2024)Navigating the storm: how managers’ decisions shape companies in crisisReview of Managerial Science10.1007/s11846-024-00801-wOnline publication date: 28-Aug-2024
  • (2024)A decadal study on identifying latent topics and research trends in open access LIS journals using topic modeling approachScientometrics10.1007/s11192-024-05058-4129:7(3841-3869)Online publication date: 1-Jul-2024
  • (2023)Web content topic modeling using LDA and HTML tagsPeerJ Computer Science10.7717/peerj-cs.14599(e1459)Online publication date: 11-Jul-2023
  • (2023)Twitter's Mirroring of the 2022 Energy Crisis: What It Teaches Decision-Makers - A Preliminary StudyRomanian Journal of Information Science and Technology10.59277/ROMJIST.2023.3-4.052023:3-4(312-322)Online publication date: 28-Sep-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media