ABSTRACT
Short text stream clustering has become an increasingly important problem due to the explosive growth of short text in diverse social medias. In this paper, we propose a model-based short text stream clustering algorithm (MStream) which can deal with the concept drift problem and sparsity problem naturally. The MStream algorithm can achieve state-of-the-art performance with only one pass of the stream, and can have even better performance when we allow multiple iterations of each batch. We further propose an improved algorithm of MStream with forgetting rules called MStreamF, which can efficiently delete outdated documents by deleting clusters of outdated batches. Our extensive experimental study shows that MStream and MStreamF can achieve better performance than three baselines on several real datasets.
Supplemental Material
- Charu C Aggarwal . 2013. A Survey of Stream Clustering Algorithms. (2013).Google Scholar
- Charu C Aggarwal and S Yu Philip . 2010. On clustering massive text and categorical data streams. Knowledge and information systems Vol. 24, 2 (2010), 171--196.Google Scholar
- Charu C Aggarwal, S Yu Philip, Jiawei Han, and Jianyong Wang . 2003. -A Framework for Clustering Evolving Data Streams. In VLDB. Elsevier, 81--92. Google ScholarDigital Library
- Amr Ahmed and Eric Xing . 2008. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering SDM. SIAM, 219--230.Google Scholar
- Hesam Amoualian, Marianne Clausel, Eric Gaussier, and Massih-Reza Amini . 2016. Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In SIGKDD. ACM, 695--704. Google ScholarDigital Library
- Charles E Antoniak . 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics (1974), 1152--1174.Google Scholar
- David Blackwell and James B MacQueen . 1973. Ferguson distributions via Pólya urn schemes. The annals of statistics (1973), 353--355.Google Scholar
- David M Blei and John D Lafferty . 2006. Dynamic topic models. In ICML. ACM, 113--120. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. (2003). deftempurl%http://dl.acm.org/citation.cfm?id=944919.944937 tempurl Google Scholar
- Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou . 2006. Density-based clustering over an evolving data stream with noise SDM. SIAM, 328--339.Google Scholar
- Arnaud Doucet, Nando De Freitas, and Neil Gordon . 2001. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice. Springer, 3--14.Google Scholar
- Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alexander J Smola, and Le Song . 2015. Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In SIGKDD. ACM, 219--228. Google ScholarDigital Library
- Thomas S Ferguson . 1973. A Bayesian analysis of some nonparametric problems. The annals of statistics (1973), 209--230.Google Scholar
- Hemant Ishwaran and Lancelot F James . 2001. Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. Vol. 96, 453 (2001), 161--173.Google ScholarCross Ref
- Tomoharu Iwata, Shinji Watanabe, Takeshi Yamada, and Naonori Ueda . 2009. Topic Tracking Model for Analyzing Consumer Purchase Behavior. IJCAI, Vol. Vol. 9. 1427--1432. Google ScholarDigital Library
- Anil K. Jain . 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters Vol. 31, 8 (2010), 651--666. Google ScholarDigital Library
- Argyris Kalogeratos, Panagiotis Zagorisios, and Aristidis Likas . 2016. Improving text stream clustering using term burstiness and co-burstiness SETN. ACM, 16.Google Scholar
- Shangsong Liang, Emine Yilmaz, and Evangelos Kanoulas . 2016. Dynamic clustering of streaming short documents. In SIGKDD. ACM, 995--1004. Google ScholarDigital Library
- Alireza Rezaei Mahdiraji . 2009. Clustering data stream: A survey of algorithms. International Journal of Knowledge-based and Intelligent Engineering Systems Vol. 13, 2 (2009), 39--44. Google ScholarDigital Library
- Hai-Long Nguyen, Yew-Kwong Woon, and Wee-Keong Ng . 2015. A survey on data stream clustering and classification. Knowledge and information systems Vol. 45, 3 (2015), 535--569. Google ScholarDigital Library
- Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell . 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning Vol. 39, 2/3 (2000), 103--134. Google ScholarDigital Library
- Gerard Salton, A. Wong, and C. S. Yang . 1975. A Vector Space Model for Automatic Indexing. Commun. ACM Vol. 18, 11 (1975), 613--620. Google ScholarDigital Library
- Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen . 2013. Sumblr: continuous summarization of evolving tweet streams SIGIR. ACM, 533--542. Google ScholarDigital Library
- Jonathan A Silva, Elaine R Faria, Rodrigo C Barros, Eduardo R Hruschka, Andre CPLF De Carvalho, and Jo ao Gama . 2013. Data stream clustering: A survey. ACM Computing Surveys (CSUR) Vol. 46, 1 (2013), 13. Google ScholarDigital Library
- Alexander Strehl and Joydeep Ghosh . 2003. Cluster ensembles--a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research Vol. 3 (2003), 583--617. Google ScholarDigital Library
- Yee Whye Teh . 2010. Dirichlet process. In Encyclopedia of machine learning. Springer, 280--287.Google Scholar
- Xuerui Wang and Andrew McCallum . 2006. Topics over time: a non-Markov continuous-time model of topical trends SIGKDD. ACM, 424--433. Google ScholarDigital Library
- Yu Wang, Eugene Agichtein, and Michele Benzi . 2012. TM-LDA: efficient online modeling of latent topic transitions in social media SIGKDD. ACM, 123--131. Google ScholarDigital Library
- Xing Wei, Jimeng Sun, and Xuerui Wang . 2007. Dynamic Mixture Models for Multiple Time-Series.. In IJCAI, Vol. Vol. 7. 2909--2914. Google ScholarDigital Library
- Jianhua Yin and Jianyong Wang . 2014. A dirichlet multinomial mixture model-based approach for short text clustering SIGKDD. ACM, 233--242. Google ScholarDigital Library
- Jianhua Yin and Jianyong Wang . 2016. A model-based approach for text clustering with outlier detection ICDE. IEEE, 625--636.Google Scholar
- Shinjae Yoo, Hao Huang, and Shiva Prasad Kasiviswanathan . 2016. Streaming spectral clustering. In ICDE. IEEE, 637--648.Google Scholar
- Shi Zhong . 2005. Efficient streaming text clustering. Neural Networks Vol. 18, 5--6 (2005), 790--798. Google ScholarDigital Library
Index Terms
- Model-based Clustering of Short Text Streams
Recommendations
Efficient clustering of short text streams using online-offline clustering
DocEng '21: Proceedings of the 21st ACM Symposium on Document EngineeringShort text stream clustering is an important but challenging task since massive amount of text is generated from different sources such as micro-blogging, question-answering, and social news aggregation websites. The two major challenges of clustering ...
A dirichlet multinomial mixture model-based approach for short text clustering
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningShort text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we ...
Short Text Stream Clustering via Frequent Word Pairs and Reassignment of Outliers to Clusters
DocEng '20: Proceedings of the ACM Symposium on Document Engineering 2020Short text stream clustering is an important but challenging task since massive amounts of text are generated from different social media. Given streams of texts, the proposed method clusters the streams of texts based on the frequently occurring word ...
Comments