skip to main content
article

iVIBRATE: Interactive visualization-based framework for clustering large datasets

Published: 01 April 2006 Publication History

Abstract

With continued advances in communication network technology and sensing technology, there is astounding growth in the amount of data produced and made available through cyberspace. Efficient and high-quality clustering of large datasets continues to be one of the most important problems in large-scale data analysis. A commonly used methodology for cluster analysis on large datasets is the three-phase framework of sampling/summarization, iterative cluster analysis, and disk-labeling. There are three known problems with this framework which demand effective solutions. The first problem is how to effectively define and validate irregularly shaped clusters, especially in large datasets. Automated algorithms and statistical methods are typically not effective in handling these particular clusters. The second problem is how to effectively label the entire data on disk (disk-labeling) without introducing additional errors, including the solutions for dealing with outliers, irregular clusters, and cluster boundary extension. The third obstacle is the lack of research about issues related to effectively integrating the three phases. In this article, we describe iVIBRATE---an interactive visualization-based three-phase framework for clustering large datasets. The two main components of iVIBRATE are its VISTA visual cluster-rendering subsystem which invites human interplay into the large-scale iterative clustering process through interactive visualization, and its adaptive ClusterMap labeling subsystem which offers visualization-guided disk-labeling solutions that are effective in dealing with outliers, irregular clusters, and cluster boundary extension. Another important contribution of iVIBRATE development is the identification of the special issues presented in integrating the two components and the sampling approach into a coherent framework, as well as the solutions for improving the reliability of the framework and for minimizing the amount of errors generated within the cluster analysis process. We study the effectiveness of the iVIBRATE framework through a walkthrough example dataset of a million records and we experimentally evaluate the iVIBRATE approach using both real-life and synthetic datasets. Our results show that iVIBRATE can efficiently involve the user in the clustering process and generate high-quality clustering results for large datasets.

References

[1]
Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J. 1999. OPTICS: Ordering points to identify the clustering structure. In Proceedings of the ACM SIGMOD Conference, 49--60.]]
[2]
Asimov, D. 1985. The grand tour: A tool for viewing multidimensional. SIAM J. Sci. Statistical Comput. 6, 1, 128--143.]]
[3]
Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval. Addison Wesley, New York.]]
[4]
Barbará, D., DuMouchel, W., Faloutsos, C., Haas, P. J., Hellerstein, J. M., Ioannidis, Y. E., Jagadish, H. V., Johnson, T., Ng, R. T., Poosala, V., Ross, K. A., and Sevcik, K. C. 1997. The New Jersey data reduction report. IEEE Data Eng. Bull. 20, 4, 3--45.]]
[5]
Bradley, P. S., Fayyad, U. M., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the ACM SIGKDD Conference, 9--15.]]
[6]
Chen, K. and Liu, L. 2004a. Clustermap: Labeling clusters in large datasets via visualization. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM), 285--293.]]
[7]
Chen, K. and Liu, L. 2004b. VISTA: Validating and refining clusters via visualization. Inf. Visualization 3, 4, 257--270.]]
[8]
Chen, K. and Liu, L. 2005. The “best k” for entropy-based categorical clustering. In Proceedings of the International Conference on Scientific and Statistical Database Management (SSDBM), 253--262.]]
[9]
Conover, W. 1998. Practical Nonparametric Statistics. John Wiley and Sons, New York.]]
[10]
Cook, D., Buja, A., Cabrera, J., and Hurley, C. 1995. Grand tour and projection pursuit. J. Comput. Graphical Statistics 23, 155--172.]]
[11]
Cox, T. F. and Cox, M. A. A. 2001. Multidimensional Scaling. Chapman and Hall, London, UK.]]
[12]
Cutting, D. R., Karger, D. R., Pedersen, J. O., and Tukey, J. W. 1992. Scater/Gather: A cluster-based approach to browsing large document collections. In Proceedings of the ACM SIGIR Conference, 318--329.]]
[13]
Dhillon, I. S., Modha, D. S., and Spangler, W. S. 1998. Visualizing class structure of multidimensional data. In Proceedings of the 30th Symposium on the Interface: Computing Science and Statistics 30, 488--493.]]
[14]
Dubes, R. C. and Jain, A. K. 1979. Validity studies in clustering methodologies. Pattern Recogn. Lett., 235--254.]]
[15]
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 226--231.]]
[16]
Faloutsos, C. and Lin, K.-I. D. 1995. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the ACM SIGMOD Conference, 163--174.]]
[17]
Freidman, J. H., Bentley, J. L., and Finkel, R. A. 1977. An algorithm for finding best matches in logarithmic expected time. ACM Trans. Math. Softw. 3, 3, 209--226.]]
[18]
Gallier, J. 2000. Geometric Methods and Applications for Computer Science and Engineering. Springer-Verlag, New York.]]
[19]
Gray, J. 2000. What next? A few remaining problems in information technlogy. In Proceedings of the Sigmod Conference 1999, ACM Turing Award Lecture (video). ACM SIGMOD Digital Symposium Collection 2, 2.]]
[20]
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference, 73--84.]]
[21]
Guha, S., Rastogi, R., and Shim, K. 2000. ROCK: A robust clustering algorithm for categorical attributes. Inf. Syst. 25, 5, 345--366.]]
[22]
Halkidi, M., Batistakis, Y., and Vazirgiannis, M. 2002. Cluster validity methods: Part I and II. SIGMOD Record 31, 2, 40--45.]]
[23]
Hinneburg, A. and Keim, D. A. 1998. An efficient approach to clustering in large multimedia databases with noise. In Proceedings of the ACM SIGKDD Conference, 58--65.]]
[24]
Hinneburg, A., Keim, D. A., and Wawryniuk, M. 1999. Visual mining of high-dimensional data. IEEE Comput. Graphics Appl., 1--8.]]
[25]
Hoffman, P., Grinstein, G., Marx, K., Grosse, I., and Stanley, E. 1997. DNA visual and analytic data mining. IEEE Visualization, 437--442.]]
[26]
Inselberg, A. 1997. Multidimensional detective. In Proceedings of the IEEE Symposium on Information Visualization, 100--107.]]
[27]
Jain, A. K. and Dubes, R. C. 1988. Algorithms for Clustering Data. Prentice Hall, New York.]]
[28]
Jain, A. K. and Dubes, R. C. 1999. Data clustering: A review. ACM Comput. Surv. 31, 264--323.]]
[29]
Kandogan, E. 2001. Visualizing multidimensional clusters, trends, and outliers using star coordinates. In Proceedings of the ACM SIGKDD Conference, 107--116.]]
[30]
Karypis, G., Han, E.-H. S., and Kumar, V. 1999. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Comput. 32, 8, 68--75.]]
[31]
Keim, D. 2001. Visual exploration of large data sets. ACM Commun. 44, 8, 38--44.]]
[32]
Larkin, J. and Simon, H. 1987. Why a diagram is (sometimes) worth ten thousand words. Cognitive Sci. 11, 65--99.]]
[33]
LeBlanc, J., Ward, M. O., and Wittels, N. 1990. Exploring n-dimensional databases. In Proceedings of the IEEE Conference on Visualization, 230--239.]]
[34]
Li, C., Chang, E., Garcia-Molina, H., and Wiederhold, G. 2002. Clustering for approximate similarity search in high-dimensional spaces. IEEE Trans. Knowledge Data Eng. 14, 4, 792--808.]]
[35]
Littlefield, R. J. 1983. Using the GLYPH concept to create user-definable display formats. National Comput. Graphics Association, 697--706.]]
[36]
Liu, H. and Motoda, H. 1998. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Boston, MA.]]
[37]
Meek, C., Thiesson, B., and Heckerman, D. 2002. The learning-curve sampling method applied to model-based clustering. J. Mach. Learn. Research 2, 397--418.]]
[38]
Newman, M. E. 2003. The structure and function of complex networks. SIAM Rev., 167--256.]]
[39]
Palmer, C. and Faloutsos, C. 2000. Density biased sampling: An improved method for data mining and clustering. In Proceedings of the ACM SIGMOD Conference, 82--92.]]
[40]
Ramaswamy, L., Gedik, B., and Liu, L. Sept. 2005. A distributed approach to node clustering in decentralized peer-to-peer networks. IEEE Trans. Parallel Distrib. Syst. 16, 9, 1--16.]]
[41]
Ross, S. M. 2000. Introduction to Probability Models. Academic Press, San Diego, CA.]]
[42]
Roussinov, D., Tolle, K., Ramsey, M., and Chen, H. 1999. Interactive Internet search through automatic clustering: An empirical study. In Proceedings of the ACM SIGIR Conference, 289--290.]]
[43]
Schutze, H. and Silverstein, C. 1997. Projections for efficient document clustering. In Proceedings of the ACM SIGIR Conference, 74--81.]]
[44]
Seo, J. and Shneiderman, B. 2002. Interactively exploring hierarchical clustering results. IEEE Comput. 35, 7, 80--86.]]
[45]
Sharma, S. 1995. Applied Multivariate Techniques. Wiley and Sons, New York.]]
[46]
Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. Wavecluster: A multiresolution clustering approach for very large spatial databases. In Proceedings of the Very Large Databases Conference (VLDB), 428--439.]]
[47]
Shneiderman, B. 2002. Inventing discovery tools: Combining information visualization with data mining. Inf. Visualization 1, 5--12.]]
[48]
Silverstein, C. and Pedersen, J. O. 1997. Almost-Constant-Time clustering of arbitrary corpus subsets. In Proceedings of the ACM SIGIR Conference, 60--66.]]
[49]
Sonka, M., Hlavac, V., and Boyle, R. 1999. Image Processing, Analysis and Machine Vision. Brooks/Cole Publishing, Pacific Grove, CA.]]
[50]
Vitter, J. S. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1, 37--57.]]
[51]
Ward, M. 1994. Xmdvtool: Integrating multiple methods for visualizing multivariate data. IEEE Visualization, 326--333.]]
[52]
Willett, P. 1988. Recent trends in hierarchic document clustering: A critical review. Inf. Process. Manage. 24, 5, 577--597.]]
[53]
Xu, X., Ester, M., Kriegel, H.-P., and Sander, J. 1998. A distribution-based clustering algorithm for mining in large spatial databases. In Proceedings of the IEEE International Conference on Data Engineering (ICDE), 324--331.]]
[54]
Yang, J., Ward, M. O., and Rundensteiner, E. A. 2002. Interactive hierarchical displays: A general framework for visualization and exploration of large multivariate datasets. Comput. Graphics J. 27, 265--283.]]
[55]
Yang, L. 2000. Interactive exploration of very large relational datasets through 3D dynamic projections. In Proceedings of the ACM SIGKDD Conference, 236--243.]]
[56]
Yang, Y. and Pedersen, J. O. 1997. A comparative study on feature selection in text categorization. In Proceedings of the ICML-97, 14th International Conference on Machine Learning, 412--420.]]
[57]
Zamir, O. and Etzioni, O. 1998. Web document clustering: A feasibility demonstration. In Proceedings of the ACM SIGIR Conference, 46--54.]]
[58]
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: An efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference, 103--114.]]

Cited By

View all
  • (2023)Interactive Trajectory Star Coordinates i-tStar and Its Extension i-tStar (3D)Computer Modeling in Engineering & Sciences10.32604/cmes.2022.020597135:1(211-237)Online publication date: 2023
  • (2022)Interactive Visual Cluster Analysis by Contrastive Dimensionality ReductionIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209423(1-11)Online publication date: 2022
  • (2020)Interactive visual analytics tool for multidimensional quantitative and categorical data analysisInformation Visualization10.1177/147387162090803419:3(234-246)Online publication date: 25-May-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 24, Issue 2
April 2006
150 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/1148020
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 April 2006
Published in TOIS Volume 24, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Clustering
  2. interactive visualization
  3. labeling
  4. large datasets
  5. performance

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)17
  • Downloads (Last 6 weeks)2
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Interactive Trajectory Star Coordinates i-tStar and Its Extension i-tStar (3D)Computer Modeling in Engineering & Sciences10.32604/cmes.2022.020597135:1(211-237)Online publication date: 2023
  • (2022)Interactive Visual Cluster Analysis by Contrastive Dimensionality ReductionIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2022.3209423(1-11)Online publication date: 2022
  • (2020)Interactive visual analytics tool for multidimensional quantitative and categorical data analysisInformation Visualization10.1177/147387162090803419:3(234-246)Online publication date: 25-May-2020
  • (2020)Interactive ClusteringACM Computing Surveys10.1145/334096053:1(1-39)Online publication date: 6-Feb-2020
  • (2020)Interactive Assigning of Conference Sessions with Visualization and Topic Modeling2020 IEEE Pacific Visualization Symposium (PacificVis)10.1109/PacificVis48177.2020.1027x(236-240)Online publication date: Jun-2020
  • (2020)Applying Science to PracticeThe Food-Energy-Water Nexus10.1007/978-3-030-29914-9_17(459-482)Online publication date: 7-Apr-2020
  • (2019)Diverse Visualization Techniques and Methods of Moving-Object-Trajectory Data: A ReviewISPRS International Journal of Geo-Information10.3390/ijgi80200638:2(63)Online publication date: 29-Jan-2019
  • (2019)PolarViz: a discriminating visualization and visual analytics tool for high-dimensional dataThe Visual Computer: International Journal of Computer Graphics10.1007/s00371-018-1558-y35:11(1567-1582)Online publication date: 1-Nov-2019
  • (2018)Evaluation of Star Coordinate BoundariesProceedings of Computer Graphics International 201810.1145/3208159.3208181(87-96)Online publication date: 11-Jun-2018
  • (2018)Exploring high-dimensional data through locally enhanced projectionsJournal of Visual Languages & Computing10.1016/j.jvlc.2018.08.00648(144-156)Online publication date: Oct-2018
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media