|
ABSTRACT
Data mining in large data sets often requires a sampling or summarization step to form an in-core representation of the data that can be processed more efficiently. Uniform random sampling is frequently used in practice and also frequently criticized because it will miss small clusters. Many natural phenomena are known to follow Zipf's distribution and the inability of uniform sampling to find small clusters is of practical concern. Density Biased Sampling is proposed to probabilistically under-sample dense regions and over-sample light regions. A weighted sample is used to preserve the densities of the original data. Density biased sampling naturally includes uniform sampling as a special case. A memory efficient algorithm is proposed that approximates density biased sampling using only a single scan of the data. We empirically evaluate density biased sampling using synthetic data sets that exhibit varying cluster size distributions finding up to a factor of six improvement over uniform sampling.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Alfred V. Aho , Ravi Sethi , Jeffrey D. Ullman, Compilers: principles, techniques, and tools, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, 1986
|
| |
2
|
P.S Bradley, Usama Fayyad, and Cory Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD- 98), pages 9-15, New York City, New York, August 1998. AAAI Press.
|
| |
3
|
P.S Bradley, Usama Fayyad, and Cory Reina. Scaling EM (expectation maximization) clustering to large databases. Technical Report MSR-TR-98- 35, Microsoft Research, Redmond, WA, November, 1998.
|
| |
4
|
Chris Buckley, Mandar Mitra, Janet Walz, and Clarie Cardie. Using clustering and superconcepts within SMART: TREC 6. In Sixth Text REtrieval Conference (TREC-6), Gaithersburg, Maryland, November 1997. National Institute of Standards and Technology (NIST), United States Department of Commerce.
|
| |
5
|
|
| |
6
|
|
| |
7
|
Usama M. Fayyad, Cory A. Reina, and Paul S. Bradley. Initialization of iterative refinement clustering algorithms. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 194-198, New York City, New York, August 1998. AAAI Press.
|
 |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
 |
12
|
Joseph M. Hellerstein , Peter J. Haas , Helen J. Wang, Online aggregation, Proceedings of the 1997 ACM SIGMOD international conference on Management of data, p.171-182, May 11-15, 1997, Tucson, Arizona, United States
|
| |
13
|
Bernardo A. Huberman, Peter L. T. Pirolli, James E. Pitkow, and Rajan M. Lukose. Strong regularities in world wide web surfing. Science, 280(5360):95-97, April 3 1998.
|
| |
14
|
Najmeh Joze-Hkajavi and Kenneth Salem. Twophase clustering of large datasets. Technical Report CS-98-27, Department of Computer Science, University of Waterloo, November 1998.
|
| |
15
|
Jeremy Kepner, Xiaohui Fan, Neta Buhcall, James Gunn, Robert Lupton, and Ghohung Xu. An automated cluster finder: the adaptive matched filter. The Astrophysics Journal, 517, 1999.
|
| |
16
|
|
 |
17
|
|
| |
18
|
|
| |
19
|
Brian D. Ripley. Spatial Statistics. John Wiley & Sons, 1981.
|
| |
20
|
Manfred Schroeder. Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. W.H. Freeman and Company, New York, 1991.
|
 |
21
|
|
 |
22
|
|
 |
23
|
|
| |
24
|
G.K. Zipf. Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Addison Wesley, Cambridge, Massachusetts, 1949.
|
CITED BY 21
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Brain Babcock , Mayur Datar , Rajeev Motwani , Liadan O'Callaghan, Maintaining variance and k-medians over data stream windows, Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.234-243, June 09-11, 2003, San Diego, California
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|