ABSTRACT
Dynamic clustering---how to efficiently maintain data clusters along with updates in the underlying dataset---is a difficult topic. This is especially true for density-based clustering, where objects are aggregated based on transitivity of proximity, under which deciding the cluster(s) of an object may require the inspection of numerous other objects. The phenomenon is unfortunate, given the popular usage of this clustering approach in many applications demanding data updates.
Motivated by the above, we investigate the algorithmic principles for dynamic clustering by DBSCAN, a successful representative of density-based clustering, and ρ-approximate DBSCAN, proposed to bring down the computational hardness of the former on static data. Surprisingly, we prove that the ρ-approximate version suffers from the very same hardness when the dataset is fully dynamic, namely, when both insertions and deletions are allowed. We also show that this issue goes away as soon as tiny further relaxation is applied, yet still ensuring the same quality---known as the ``sandwich guarantee''---of ρ-approximate DBSCAN. Our algorithms guarantee near-constant update processing, and outperform existing approaches by a factor over two orders of magnitude.
- M. Ankerst, M. M. Breunig, H. Kriegel, and J. Sander. OPTICS: ordering points to identify the clustering structure. In SIGMOD, pages 49--60, 1999. Google ScholarDigital Library
- S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A. Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. JACM, 45(6):891--923, 1998. Google ScholarDigital Library
- N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In SIGMOD, pages 322--331, 1990. Google ScholarDigital Library
- F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In ICDM, pages 328--339, 2006.Google ScholarCross Ref
- T. M. Chan. A dynamic data structure for 3-d convex hulls and 2-d nearest neighbor queries. JACM, 57(3), 2010. Google ScholarDigital Library
- J. Erickson. On the relative complexities of some geometric problems. In CCCG, pages 85--90, 1995.Google Scholar
- J. Erickson. New lower bounds for Hopcroft's problem. Disc. & Comp. Geo., 16(4):389--418, 1996.Google ScholarDigital Library
- M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment. In VLDB, pages 323--333, 1998. Google ScholarDigital Library
- M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In SIGKDD, pages 226--231, 1996. Google ScholarDigital Library
- J. Gan and Y. Tao. DBSCAN revisited: Mis-claim, un-fixability, and approximation. In SIGMOD, pages 519--530, 2015. Google ScholarDigital Library
- A. Gunawan. A faster algorithm for DBSCAN. Master's thesis, Technische University Eindhoven, March 2013.Google Scholar
- A. Guttman. R-trees: a dynamic index structure for spatial searching. In SIGMOD, pages 47--57, 1984. Google ScholarDigital Library
- J. Han, M. Kamber, and J. Pei. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2012. Google ScholarDigital Library
- J. Holm, K. de Lichtenberg, and M. Thorup. Poly-logarithmic deterministic fully-dynamic algorithms for connectivity, minimum spanning tree, 2-edge, and biconnectivity. JACM, 48(4):723--760, 2001. Google ScholarDigital Library
- S. Lühr and M. Lazarescu. Incremental clustering of dynamic data streams using connectivity based representative points. 68(1):1--27, 2009. Google ScholarDigital Library
- D. M. Mount and E. Park. A dynamic data structure for approximate range searching. In SoCG, pages 247--256, 2010. Google ScholarDigital Library
- S. Nassar, J. Sander, and C. Cheng. Incremental and effective data summarization for dynamic hierarchical clustering. In SIGMOD, pages 467--478, 2004. Google ScholarDigital Library
- S. Nittel, K. T. Leung, and A. Braverman. Scaling clustering algorithms for massive data sets using data streams. In ICDE, page 830, 2004. Google ScholarDigital Library
- I. Ntoutsi, A. Zimek, T. Palpanas, P. Kröger, and H. Kriegel. Density-based projected clustering over high dimensional data streams. In ICDM, pages 987--998, 2012.Google ScholarCross Ref
- R. G. Pensa, D. Ienco, and R. Meo. Hierarchical co-clustering: off-line and incremental approaches. Data Min. Knowl. Discov., 28(1):31--64, 2014. Google ScholarDigital Library
- S. Singh and A. Awekar. Incremental shared nearest neighbor density-based clustering. In CIKM, pages 1533--1536, 2013. Google ScholarDigital Library
- P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Pearson, 2006. Google ScholarDigital Library
- R. E. Tarjan. Efficiency of a good but not linear set union algorithm. JACM, 22(2):215--225, 1975. Google ScholarDigital Library
Index Terms
- Dynamic Density Based Clustering
Recommendations
Integration of particle swarm optimization and genetic algorithm for dynamic clustering
Although the algorithms for cluster analysis are continually improving, most clustering algorithms still need to set the number of clusters. Thus, this study proposes a novel dynamic clustering approach based on particle swarm optimization (PSO) and ...
Density-based semi-supervised clustering
Semi-supervised clustering methods guide the data partitioning and grouping process by exploiting background knowledge, among else in the form of constraints. In this study, we propose a semi-supervised density-based clustering method. Density-based ...
Dynamic clustering using combinatorial particle swarm optimization
Combinatorial Particle Swarm Optimization (CPSO) is a relatively recent technique for solving combinatorial optimization problems. CPSO has been used in different applications, e.g., partitional clustering and project scheduling problems, and it has ...
Comments