ABSTRACT
Recent years have witnessed the unprecedented growth of online social media, which empower short texts as the prevalent format for information of Internet. Given the nature of sparsity, however, short text topic modeling remains a critical yet much-watched challenge in both academy and industry. Rich research efforts have been put on building different types of probabilistic topic models for short texts, among which the self aggregation methods without using auxiliary information become an emerging solution for providing informative cross-text word co-occurrences. However, models along this line are still rarely seen, and the representative one Self-Aggregation Topic Model (SATM) is prone to overfitting and computationally expensive. In light of this, in this paper, we propose a novel probabilistic model called Pseudo-document-based Topic Model (PTM) for short text topic modeling. PTM introduces the concept of pseudo document to implicitly aggregate short texts against data sparsity. By modeling the topic distributions of latent pseudo documents rather than short texts, PTM is expected to gain excellent performance in both accuracy and efficiency. A Sparsity-enhanced PTM (SPTM for short) is also proposed by applying Spike and Slab prior, with the purpose of eliminating undesired correlations between pseudo documents and latent topics. Extensive experiments on various real-world data sets with state-of-the-art baselines demonstrate the high quality of topics learned by PTM and its robustness with reduced training samples. It is also interesting to show that i) SPTM gains a clear edge over PTM when the number of pseudo documents is relatively small, and ii) the constraint that a short text belongs to only one pseudo document is critically important for the success of PTM. We finally take an in-depth semantic analysis to unveil directly the fabulous function of pseudo documents in finding cross-text word co-occurrences for topic modeling.
- D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, apr 2012. Google ScholarDigital Library
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, mar 2003. Google ScholarDigital Library
- T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences, 101:5228--5235, 2004.Google ScholarCross Ref
- T. Hofmann. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pages 50--57, 1999. Google ScholarDigital Library
- L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88, 2010. Google ScholarDigital Library
- Y. Hu, A. John, F. Wang, and S. Kambhampati. Et-lda: Joint topic modeling for aligning events and their twitter feedback. In AAAI, pages 59--65, 2012. Google ScholarDigital Library
- H. Ishwaran and J. S. Rao. Spike and slab variable selection: frequentist and bayesian strategies. The Annals of Statistics, 33(2):730--773, 2005.Google ScholarCross Ref
- O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In Proceedings of the 20th ACM international conference on Information and knowledge management, pages 775--784, 2011. Google ScholarDigital Library
- A. Q. Li, A. Ahmed, S. Ravi, and A. J. Smola. Reducing the sampling complexity of topic models. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 891--900, 2014. Google ScholarDigital Library
- W. Li and A. McCallum. Pachinko allocation: Dag-structured mixture models of topic correlations. In Proceedings of the 23rd international conference on Machine learning, pages 577--584, 2006. Google ScholarDigital Library
- T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: Mining focused topics and focused terms in short text. In Proceedings of the 23rd international conference on World wide web, pages 539--550, 2014. Google ScholarDigital Library
- X. Liu, B. Du, C. Deng, M. Liu, and B. Lang. Structure sensitive hashing with adaptive product quantization. IEEE Transactions on Cybernetics, PP(0):1--12, 2015.Google Scholar
- Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: An empirical study of plsa and lda. Information Retrieval, 14(2):178--203, apr 2011. Google ScholarDigital Library
- R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval, pages 889--892, 2013. Google ScholarDigital Library
- D. Mimno, H. M. Wallach, E. Talley, M. Leenders, and A. McCallum. Optimizing semantic coherence in topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 262--272, 2011. Google ScholarDigital Library
- D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 100--108, 2010. Google ScholarDigital Library
- K. Nigam, A. McCallum, S. Thrun, and T. Mitchell. Text classification from labeled and unlabeled documents using em. Machine learning, 39(2--3):103--134, may 2000. Google ScholarDigital Library
- X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2270--2276, 2015. Google ScholarDigital Library
- J. Tang, Z. Meng, X. Nguyen, Q. Mei, and M. Zhang. Understanding the limiting factors of topic modeling via posterior contraction analysis. In Proceedings of The 31st International Conference on Machine Learning, pages 190--198, 2014.Google ScholarDigital Library
- J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 5--13, 2013. Google ScholarDigital Library
- Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476):1566--1581, 2006.Google ScholarCross Ref
- H. M. Wallach. Structured Topic Models for Language. PhD thesis, University of Cambridge, 2008.Google Scholar
- C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In Advances in neural information processing systems, pages 1982--1989. 2009. Google ScholarDigital Library
- X. Wang and A. McCallum. Topics over time: A non-markov continuous-time model of topical trends. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 424--433, 2006. Google ScholarDigital Library
- J. Weng, E.-P. Lim, J. Jiang, and Q. He. Twitterrank: Finding topic-sensitive influential twitterers. In Proceedings of the third ACM international conference on Web search and data mining, pages 261--270, 2010. Google ScholarDigital Library
- P. Xie and E. P. Xing. Integrating document clustering and topic modeling. Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence, 2013.Google Scholar
- X. Yan, J. Guo, Y. Lan, and X. Cheng. A biterm topic model for short texts. In Proceedings of the 22nd international conference on World Wide Web, pages 1445--1456, 2013. Google ScholarDigital Library
- L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 937--946, 2009. Google ScholarDigital Library
- J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 233--242, 2014. Google ScholarDigital Library
- W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349, 2011. Google ScholarDigital Library
- X. W. Zhao, J. Wang, Y. He, J.-Y. Nie, and X. Li. Originator or propagator?: Incorporating social role theory into topic models for twitter content analysis. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1649--1654, 2013. Google ScholarDigital Library
- A. Zubiaga and H. Ji. Harnessing web page directories for large-scale classification of tweets. In Proceedings of the 22nd international conference on World Wide Web companion, pages 225--226, 2013. Google ScholarDigital Library
- Y. Zuo, J. Zhao, and K. Xu. Word network topic model: a simple but general solution for short and imbalanced texts. Knowledge and Information Systems, pages 1--20, 2015. Google ScholarDigital Library
Index Terms
- Topic Modeling of Short Texts: A Pseudo-Document View
Recommendations
Topic Modeling for Short Texts with Auxiliary Word Embeddings
SIGIR '16: Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information RetrievalFor many applications that require semantic understanding of short texts, inferring discriminative and coherent latent topics from short texts is a critical and fundamental task. Conventional topic models largely rely on word co-occurrences to derive ...
Sparse Biterm Topic Model for Short Texts
Web and Big DataAbstractExtracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
LJST: A Semi-supervised Joint Sentiment-Topic Model for Short Texts
AbstractSeveral methods on simultaneous detection of sentiment and topics have been proposed to obtain subjective information such as opinion, attitude and feelings expressed in texts. Most of the techniques fail to produce desired results for short ...
Comments