|
ABSTRACT
The clustering problem concerns the discovery of homogeneous groups of data according to a certain similarity measure. Clustering suffers from the curse of dimensionality. It is not meaningful to look for clusters in high dimensional spaces as the average density of points anywhere in input space is likely to be low. As a consequence, distance functions that equally use all input features may be ineffective. We introduce an algorithm that discovers clusters in subspaces spanned by different combinations of dimensions via local weightings of features. This approach avoids the risk of loss of information encountered in global dimensionality reduction techniques. Our method associates to each cluster a weight vector, whose values capture the relevance of features within the corresponding cluster. In this paper we present an efficient SQL implementation of our algorithm, that enables the discovery of clusters on data residing inside a relational DBMS.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Charu C. Aggarwal , Joel L. Wolf , Philip S. Yu , Cecilia Procopiuc , Jong Soo Park, Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD international conference on Management of data, p.61-72, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
|
| |
2
|
P. Arabie and L. Hubert. An overview of combinatorial data analysis. Clustering and Classification. World Scientific Pub., pages 5--63, 1996.
|
| |
3
|
|
| |
4
|
|
| |
5
|
P. Cheeseman and J. Stutz. Bayessian Classification (autoclass): Theory and Results, chapter 6. AAAI/MIT Press, 1996.
|
| |
6
|
|
| |
7
|
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39(1):1--38, 1997.
|
| |
8
|
|
| |
9
|
|
| |
10
|
M. Ester, H. P. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. In Proc. KDD, 1995.
|
| |
11
|
Z. Ghahramani and G. E. Hinton. The EM Algorithm for Mixtures of Factor Analyzers. Technical Report CRG-TR-96-1, Dept. of Computer Science, Univ. of Toronto, 1996.
|
 |
12
|
Eamonn Keogh , Kaushik Chakrabarti , Michael Pazzani , Sharad Mehrotra, Locally adaptive dimensionality reduction for indexing large time series databases, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.151-162, May 21-24, 2001, Santa Barbara, California, United States
|
| |
13
|
R. Michalski and R. Stepp. Machine Learning: An Artificial Intelligence Approach, chapter 'Learning from observation: Conceptual Clustering'. IOGA Publishing Co., 1983.
|
| |
14
|
|
 |
15
|
|
 |
16
|
Rakesh Agrawal , Johannes Gehrke , Dimitrios Gunopulos , Prabhakar Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, Proceedings of the 1998 ACM SIGMOD international conference on Management of data, p.94-105, June 01-04, 1998, Seattle, Washington, United States
|
 |
17
|
Alexander Thomasian , Vittorio Castelli , Chung-Sheng Li, Clustering and singular value decomposition for approximate indexing in high dimensional spaces, Proceedings of the seventh international conference on Information and knowledge management, p.201-207, November 02-07, 1998, Bethesda, Maryland, United States
[doi> 10.1145/288627.288658]
|
| |
18
|
|
 |
19
|
|
 |
20
|
Tian Zhang , Raghu Ramakrishnan , Miron Livny, BIRCH: an efficient data clustering method for very large databases, Proceedings of the 1996 ACM SIGMOD international conference on Management of data, p.103-114, June 04-06, 1996, Montreal, Quebec, Canada
|
|