skip to main content
10.1145/2939672.2939880acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Topic Modeling of Short Texts: A Pseudo-Document View

Published:13 August 2016Publication History

ABSTRACT

Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.

References

  1. D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, apr 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, mar 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  4. T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Hu, A. John, F. Wang, and S. Kambhampati. Et-lda: Joint topic modeling for aligning events and their twitter feedback. In AAAI, pages 59--65, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730--773, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  8. O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 775--784, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 891--900, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: Mining focused topics and focused terms in short text. In Proceedings of the 23rd international conference on World wide web, pages 539--550, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. X. Liu, B. Du, C. Deng, M. Liu, and B. Lang. Structure sensitive hashing with adaptive product quantization. IEEE Transactions on Cybernetics, PP(0):1--12, 2015.Google ScholarGoogle Scholar
  13. Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Information Retrieval, 14(2):178--203, apr 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889--892, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262--272, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100--108, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2--3):103--134, may 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2270--2276, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of The 31st International Conference on Machine Learning, pages 190--198, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 5--13, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. H. M. Wallach. Structured Topic Models for Language. PhD thesis, University of Cambridge, 2008.Google ScholarGoogle Scholar
  23. C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In Advances in neural information processing systems, pages 1982--1989. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: Finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261--270, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Xie and E. P. Xing. Integrating document clustering and topic modeling. Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2013.Google ScholarGoogle Scholar
  27. X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, pages 1445--1456, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 937--946, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 233--242, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. W. Zhao, J. Wang, Y. He, J.-Y. Nie, and X. Li. Originator or propagator?: Incorporating social role theory into topic models for twitter content analysis. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1649--1654, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. Zubiaga and H. Ji. Harnessing web page directories for large-scale classification of tweets. In Proceedings of the 22nd international conference on World Wide Web companion, pages 225--226, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Y. Zuo, J. Zhao, and K. Xu. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, pages 1--20, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Topic Modeling of Short Texts: A Pseudo-Document View

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
        August 2016
        2176 pages
        ISBN:9781450342322
        DOI:10.1145/2939672

        Copyright © 2016 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 August 2016

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '16 Paper Acceptance Rate66of1,115submissions,6%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader