skip to main content
10.1145/3219819.3220094acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Model-based Clustering of Short Text Streams

Published:19 July 2018Publication History

ABSTRACT

Short text stream clustering has become an increasingly important problem due to the explosive growth of short text in diverse social medias. In this paper, we propose a model-based short text stream clustering algorithm (MStream) which can deal with the concept drift problem and sparsity problem naturally. The MStream algorithm can achieve state-of-the-art performance with only one pass of the stream, and can have even better performance when we allow multiple iterations of each batch. We further propose an improved algorithm of MStream with forgetting rules called MStreamF, which can efficiently delete outdated documents by deleting clusters of outdated batches. Our extensive experimental study shows that MStream and MStreamF can achieve better performance than three baselines on several real datasets.

Skip Supplemental Material Section

Supplemental Material

chao_model-based_clustering.mp4

mp4

365.9 MB

References

  1. Charu C Aggarwal . 2013. A Survey of Stream Clustering Algorithms. (2013).Google ScholarGoogle Scholar
  2. Charu C Aggarwal and S Yu Philip . 2010. On clustering massive text and categorical data streams. Knowledge and information systems Vol. 24, 2 (2010), 171--196.Google ScholarGoogle Scholar
  3. Charu C Aggarwal, S Yu Philip, Jiawei Han, and Jianyong Wang . 2003. -A Framework for Clustering Evolving Data Streams. In VLDB. Elsevier, 81--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Amr Ahmed and Eric Xing . 2008. Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering SDM. SIAM, 219--230.Google ScholarGoogle Scholar
  5. Hesam Amoualian, Marianne Clausel, Eric Gaussier, and Massih-Reza Amini . 2016. Streaming-lda: A copula-based approach to modeling topic dependencies in document streams. In SIGKDD. ACM, 695--704. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Charles E Antoniak . 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics (1974), 1152--1174.Google ScholarGoogle Scholar
  7. David Blackwell and James B MacQueen . 1973. Ferguson distributions via Pólya urn schemes. The annals of statistics (1973), 353--355.Google ScholarGoogle Scholar
  8. David M Blei and John D Lafferty . 2006. Dynamic topic models. In ICML. ACM, 113--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. David M. Blei, Andrew Y. Ng, and Michael I. Jordan . 2003. Latent Dirichlet Allocation. J. Mach. Learn. Res. (2003). deftempurl%http://dl.acm.org/citation.cfm?id=944919.944937 tempurl Google ScholarGoogle Scholar
  10. Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou . 2006. Density-based clustering over an evolving data stream with noise SDM. SIAM, 328--339.Google ScholarGoogle Scholar
  11. Arnaud Doucet, Nando De Freitas, and Neil Gordon . 2001. An introduction to sequential Monte Carlo methods. In Sequential Monte Carlo methods in practice. Springer, 3--14.Google ScholarGoogle Scholar
  12. Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alexander J Smola, and Le Song . 2015. Dirichlet-hawkes processes with applications to clustering continuous-time document streams. In SIGKDD. ACM, 219--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Thomas S Ferguson . 1973. A Bayesian analysis of some nonparametric problems. The annals of statistics (1973), 209--230.Google ScholarGoogle Scholar
  14. Hemant Ishwaran and Lancelot F James . 2001. Gibbs sampling methods for stick-breaking priors. J. Amer. Statist. Assoc. Vol. 96, 453 (2001), 161--173.Google ScholarGoogle ScholarCross RefCross Ref
  15. Tomoharu Iwata, Shinji Watanabe, Takeshi Yamada, and Naonori Ueda . 2009. Topic Tracking Model for Analyzing Consumer Purchase Behavior. IJCAI, Vol. Vol. 9. 1427--1432. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Anil K. Jain . 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters Vol. 31, 8 (2010), 651--666. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Argyris Kalogeratos, Panagiotis Zagorisios, and Aristidis Likas . 2016. Improving text stream clustering using term burstiness and co-burstiness SETN. ACM, 16.Google ScholarGoogle Scholar
  18. Shangsong Liang, Emine Yilmaz, and Evangelos Kanoulas . 2016. Dynamic clustering of streaming short documents. In SIGKDD. ACM, 995--1004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Alireza Rezaei Mahdiraji . 2009. Clustering data stream: A survey of algorithms. International Journal of Knowledge-based and Intelligent Engineering Systems Vol. 13, 2 (2009), 39--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hai-Long Nguyen, Yew-Kwong Woon, and Wee-Keong Ng . 2015. A survey on data stream clustering and classification. Knowledge and information systems Vol. 45, 3 (2015), 535--569. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kamal Nigam, Andrew McCallum, Sebastian Thrun, and Tom M. Mitchell . 2000. Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning Vol. 39, 2/3 (2000), 103--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Gerard Salton, A. Wong, and C. S. Yang . 1975. A Vector Space Model for Automatic Indexing. Commun. ACM Vol. 18, 11 (1975), 613--620. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Lidan Shou, Zhenhua Wang, Ke Chen, and Gang Chen . 2013. Sumblr: continuous summarization of evolving tweet streams SIGIR. ACM, 533--542. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jonathan A Silva, Elaine R Faria, Rodrigo C Barros, Eduardo R Hruschka, Andre CPLF De Carvalho, and Jo ao Gama . 2013. Data stream clustering: A survey. ACM Computing Surveys (CSUR) Vol. 46, 1 (2013), 13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alexander Strehl and Joydeep Ghosh . 2003. Cluster ensembles--a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research Vol. 3 (2003), 583--617. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Yee Whye Teh . 2010. Dirichlet process. In Encyclopedia of machine learning. Springer, 280--287.Google ScholarGoogle Scholar
  27. Xuerui Wang and Andrew McCallum . 2006. Topics over time: a non-Markov continuous-time model of topical trends SIGKDD. ACM, 424--433. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yu Wang, Eugene Agichtein, and Michele Benzi . 2012. TM-LDA: efficient online modeling of latent topic transitions in social media SIGKDD. ACM, 123--131. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Xing Wei, Jimeng Sun, and Xuerui Wang . 2007. Dynamic Mixture Models for Multiple Time-Series.. In IJCAI, Vol. Vol. 7. 2909--2914. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jianhua Yin and Jianyong Wang . 2014. A dirichlet multinomial mixture model-based approach for short text clustering SIGKDD. ACM, 233--242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Jianhua Yin and Jianyong Wang . 2016. A model-based approach for text clustering with outlier detection ICDE. IEEE, 625--636.Google ScholarGoogle Scholar
  32. Shinjae Yoo, Hao Huang, and Shiva Prasad Kasiviswanathan . 2016. Streaming spectral clustering. In ICDE. IEEE, 637--648.Google ScholarGoogle Scholar
  33. Shi Zhong . 2005. Efficient streaming text clustering. Neural Networks Vol. 18, 5--6 (2005), 790--798. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Model-based Clustering of Short Text Streams

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
        July 2018
        2925 pages
        ISBN:9781450355520
        DOI:10.1145/3219819

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 19 July 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader