|
ABSTRACT
Computing statistical information on probabilistic data has attracted a lot of attention recently, as the data generated from a wide range of data sources are inherently fuzzy or uncertain. In this paper, we study an important statistical query on probabilistic data: finding the frequent items. One straightforward approach to identify the frequent items in a probabilistic data set is to simply compute the expected frequency of an item and decide if it exceeds a certain fraction of the expected size of the whole data set. However, this simple definition misses important information about the internal structure of the probabilistic data and the interplay among all the uncertain entities. Thus, we propose a new definition based on the possible world semantics that has been widely adopted for many query types in uncertain data management, trying to find all the items that are likely to be frequent in a randomly generated possible world. Our approach naturally leads to the study of ranking frequent items based on confidence as well. Finding likely frequent items in probabilistic data turns out to be much more difficult. We first propose exact algorithms for offline data with either quadratic or cubic time. Next, we design novel sampling-based algorithms for streaming data to find all approximately likely frequent items with theoretically guaranteed high probability and accuracy. Our sampling schemes consume sublinear memory and exhibit excellent scalability. Finally, we verify the effectiveness and efficiency of our algorithms using both real and synthetic data sets with extensive experimental evaluations.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
C. Aggarwal. On density based transforms for uncertain data mining. In ICDE, 2007.
|
| |
2
|
P. Agrawal, O. Benjelloun, A. Das Sarma, C. Hayworth, S. Nabar, T. Sugihara, and J. Widom. Trio: A system for data, uncertainty, and lineage. In VLDB, 2006.
|
| |
3
|
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, 1994.
|
| |
4
|
N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. Journal of Computer and System Sciences, 58(1):137--147, 1999.
|
| |
5
|
L. Antova, C. Koch, and D. Olteanu. 10106 worlds and beyond: Efficient representation and processing of incomplete information. In ICDE, 2007.
|
| |
6
|
L. Antova, C. Koch, and D. Olteanu. From complete to incomplete information and back. In SIGMOD, 2007.
|
| |
7
|
O. Benjelloun, A. D. Sarma, A. Halevy, and J. Widom. ULDBs: databases with uncertainty and lineage. In VLDB, 2006.
|
| |
8
|
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robust and efficient fuzzy match for online data cleaning. In SIGMOD, 2003.
|
| |
9
|
R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003.
|
| |
10
|
G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In SIGMOD, 2007.
|
| |
11
|
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Diamond in the rough: finding hierarchical heavy hitters in multi-dimensional data. In SIGMOD, 2004.
|
| |
12
|
G. Cormode and S. Muthukrishnan. What?s hot and what?s not: tracking most frequent items dynamically. In PODS, 2003.
|
| |
13
|
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004.
|
| |
14
|
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007.
|
| |
15
|
N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. VLDB Journal, 16(4):523--544, 2007.
|
| |
16
|
A. Deshpande, C. Guestrin, S. Madden, J. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In VLDB, 2004.
|
| |
17
|
C. Estan and G. Varghese. New directions in traffic measurement and accounting. In SIGCOMM, 2002.
|
| |
18
|
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In VLDB, 1998.
|
| |
19
|
A. Fuxman, E. Fazli, and R. J. Miller. ConQuer: efficient management of inconsistent databases. In SIGMOD, 2005.
|
| |
20
|
P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In SIGMOD, 1998.
|
| |
21
|
A. Halevy, A. Rajaraman, and J. Ordille. Data integration: the teenage year. In VLDB, 2006.
|
| |
22
|
J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In SIGMOD, 2001.
|
| |
23
|
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9--37, 1998.
|
| |
24
|
T. S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. In PODS, 2007.
|
| |
25
|
C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou. Dynamically maintaining frequent items over a data stream. In CIKM, 2003.
|
| |
26
|
B. Kanagal and A. Deshpande. Online filtering, smoothing and probabilistic modeling of streaming data. In ICDE, 2008.
|
| |
27
|
R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1), 2003.
|
| |
28
|
L. K. Lee and H. F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In PODS, 2006.
|
| |
29
|
V. Ljosa and A. Singh. APLA: Indexing arbitrary probability distributions. In ICDE, 2007.
|
| |
30
|
A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In ICDE, 2005.
|
| |
31
|
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In VLDB, 2002.
|
| |
32
|
A. Metwally, D. Agrawal, and A. E. Abbadi. An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst., 31(3):1095--1133, 2006.
|
| |
33
|
J. Pei, B. Jiang, X. Lin, and Y. Yuan. Probabilistic skylines on uncertain data. In VLDB, 2007.
|
| |
34
|
C. Re, N. Dalvi, and D. Suciu. Efficient top-k query evaluation on probalistic databases. In ICDE, 2007.
|
| |
35
|
C. Re and D. Suciu. Materialized views in probabilistic databases for information exchange and query optimization. In VLDB, 2007.
|
| |
36
|
A. D. Sarma, O. Benjelloun, A. Halevy, and J. Widom. Working models for uncertain data. In ICDE, 2006.
|
| |
37
|
P. Sen and A. Deshpande. Representing and querying correlated tuples in probabilistic databases. In ICDE, 2007.
|
| |
38
|
S. Singh, C. Mayfield, S. Prabhakar, R. Shah, and S. Hambrusch. Indexing uncertain categorical data. In ICDE, 2007.
|
| |
39
|
M. A. Soliman, I. F. Ilyas, and K. C. Chang. Top-k query processing in uncertain databases. In ICDE, 2007.
|
| |
40
|
Y. Tao, R. Cheng, X. Xiao, W. K. Ngai, B. Kao, and S. Prabhakar. Indexing multi-dimensional uncertain data with arbitrary probability density functions. In VLDB, 2005.
|
|