ABSTRACT
The rapid development of online social media sites is accompanied by the generation of tremendous web contents. Web users are shifting from data consumers to data producers. As a result, topic detection and tracking without taking users' interests into account is not enough. This paper presents a statistical model that can detect interpretable trends and topics from document streams, where each trend (short for trending story) corresponds to a series of continuing events or a storyline. A topic is represented by a cluster of words frequently co-occurred. A trend can contain multiple topics and a topic can be shared by different trends. In addition, by leveraging a Recurrent Chinese Restaurant Process (RCRP), the number of trends in our model can be determined automatically without human intervention, so that our model can better generalize to unseen data. Furthermore, our proposed model incorporates user interest to fully simulate the generation process of web contents, which offers the opportunity for personalized recommendation in online social media. Experiments on three different datasets indicated that our proposed model can capture meaningful topics and trends, monitor rise and fall of detected trends, outperform baseline approach in terms of perplexity on held-out dataset, and improve the result of user participation prediction by leveraging users' interests to different trends.
- Ahmed, A., Ho, Q., Eisenstein, J., Xing, E., Smola, A. J. and Teo, C. H. 2011. Unified analysis of streaming news. Proceedings of the 20th international conference on World Wide Web (WWW'11) ACM 267--276. Google ScholarDigital Library
- Ahmed, A. and Xing, E. 2008. Dynamic non-parametric mixture models and the recurrent chinese restaurant process. Proceedings of SDM 2008.Google Scholar
- AlSumait, L., Barbara, D. and Domeniconi, C. 2008. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking. Proceedings of the 2008 Eighth IEEE International Conference on Data Mining IEEE Computer Society 3--12. Google ScholarDigital Library
- Blei, D., Ng, A. and Jordan, M. 2003. Latent dirichlet allocation. The Journal of Machine Learning Research. 3, 993--1022. Google ScholarDigital Library
- Blei, D. M. and Lafferty, J. D. 2006. Dynamic topic models. Proceedings of the 23rd international conference on Machine learning Pittsburgh, Pennsylvania ACM 113--120. http://doi.acm.org/10.1145/1143844.1143859 Google ScholarDigital Library
- He, Q., Chen, B., Pei, J., Qiu, B., Mitra, P. and Giles, L. 2009. Detecting topic evolution in scientific literature: how can citations help? Proceeding of the 18th ACM conference on Information and knowledge management ACM 957--966. Google ScholarDigital Library
- Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: scatter/gather on retrieval results. Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval Zurich, Switzerland ACM 76--84. 10.1145/243199.243216 Google ScholarDigital Library
- Hofmann, T. 1999. Probabilistic latent semantic indexing. Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval ACM New York, NY, USA 50--57. Google ScholarDigital Library
- Kawamae, N. 2011. Trend analysis model: trend consists of temporal words, topics, and timestamps. Proceedings of the fourth ACM international conference on Web search and data mining ACM 317--326. Google ScholarDigital Library
- McCallum, A., Corrada-Emmanuel, A. and Wang, X. 2005. Topic and role discovery in social networks. Proceedings of the 19th international joint conference on Artificial intelligence Morgan Kaufmann Publishers Inc. 786--791. Google ScholarDigital Library
- Mei, Q. and Zhai, C. 2005. Discovering evolutionary theme patterns from text: an exploration of temporal text mining. Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining ACM New York, NY, USA 198--207. Google ScholarDigital Library
- Morinaga, S. and Yamanishi, K. 2004. Tracking dynamics of topic trends using a finite mixture model. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining Seattle, WA, USA ACM 811--816. http://doi.acm.org/10.1145/1014052.1016919 Google ScholarDigital Library
- Rosen-Zvi, M., Griffiths, T., Steyvers, M. and Smyth, P. 2004. The author-topic model for authors and documents. Proceedings of the 20th conference on Uncertainty in artificial intelligence Banff, Canada AUAI Press 487--494. Google ScholarDigital Library
- Schult, R. and Spiliopoulou, M. 2006. Discovering emerging topics in unlabelled text collections. Lecture Notes in Computer Science. 4152, 353--366. Google ScholarDigital Library
- Shahaf, D., Guestrin, C. and Horvitz, E. 2012. Trains of thought: generating information maps. Proceedings of the 21st international conference on World Wide Web Lyon, France ACM 899--908. Google ScholarDigital Library
- Sun, Y., Tang, J., Han, J., Gupta, M. and Zhao, B. 2010. Community evolution detection in dynamic heterogeneous information networks. Proceedings of the Eighth Workshop on Mining and Learning with Graphs ACM 137--146. Google ScholarDigital Library
- Wang, X. and McCallum, A. 2006. Topics over time: a non-markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining ACM 424--433. Google ScholarDigital Library
- Zeng, H.-J., He, Q.-C., Chen, Z., Ma, W.-Y. and Ma, J. 2004. Learning to cluster web search results. Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval Sheffield, United Kingdom ACM 210--217. 10.1145/1008992.1009030 Google ScholarDigital Library
Index Terms
- TUT: a statistical model for detecting trends, topics and user interests in social media
Recommendations
A density-based method for adaptive LDA model selection
Topic models have been successfully used in information classification and retrieval. These models can capture word correlations in a collection of textual documents with a low-dimensional set of multinomial distribution, called ''topics''. However, it ...
RankSum—An unsupervised extractive text summarization based on rank fusion
AbstractIn this paper, we propose Ranksum, an approach for extractive text summarization of single documents based on the rank fusion of four multi-dimensional sentence features extracted for each sentence: topic information, semantic content, ...
Graphical abstractDisplay Omitted
Highlights- A unified summarization framework with multi-dimensional sentence features.
- ...
Multi-document summarisation using feature distribution analysis
Recently, opinion documents have been growing rapidly in an environment where anyone can express an opinion on the internet or SNS. This situation requires an automatic summarisation technique in order to understand the contents of large-scale opinion ...
Comments