|
ABSTRACT
We consider the problem of identifying correlations in data streams. Surprisingly, our work seems to be the first to consider this natural problem. In the centralized model, we consider a stream of pairs (i,j) ∈ [n]2 whose frequencies define a joint distribution (X,Y). In the distributed model, each coordinate of the pair may appear separately in the stream. We present a range of algorithms for approximating to what extent X and Y are independent, i.e., how close the joint distribution is to the product of the marginals. We consider various measures of closeness including ℓ1, ℓ2, and the mutual information between X and Y. Our algorithms are based on "sketching sketches", i.e., composing small-space linear synopses of the distributions. Perhaps ironically, the biggest technical challenges that arise relate to ensuring that different components of our estimates are sufficiently independent.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Noga Alon , Alexandr Andoni , Tali Kaufman , Kevin Matulef , Ronitt Rubinfeld , Ning Xie, Testing k-wise and almost k-wise independence, Proceedings of the thirty-ninth annual ACM symposium on Theory of computing, June 11-13, 2007, San Diego, California, USA
[doi> 10.1145/1250790.1250863]
|
| |
2
|
|
| |
3
|
Rohit Ananthakrishna , Abhinandan Das , Johannes Gehrke , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Efficient Approximation of Correlated Sums on Data Streams, IEEE Transactions on Knowledge and Data Engineering, v.15 n.3, p.569-572, March 2003
[doi> 10.1109/TKDE.2003.1198391]
|
 |
4
|
Brian Babcock , Shivnath Babu , Mayur Datar , Rajeev Motwani , Jennifer Widom, Models and issues in data stream systems, Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 03-05, 2002, Madison, Wisconsin
[doi> 10.1145/543613.543615]
|
| |
5
|
Ziv Bar-Yossef , Ravi Kumar , D. Sivakumar, Reductions in streaming algorithms, with an application to counting triangles in graphs, Proceedings of the thirteenth annual ACM-SIAM symposium on Discrete algorithms, p.623-632, January 06-08, 2002, San Francisco, California
|
| |
6
|
T. Batu , L. Fortnow , E. Fischer , R. Kumar , R. Rubinfeld , P. White, Testing Random Variables for Independence and Identity, Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, p.442, October 14-17, 2001
|
 |
7
|
Lakshminath Bhuvanagiri , Sumit Ganguly , Deepanjan Kesh , Chandan Saha, Simpler algorithm for estimating frequency moments of data streams, Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, p.708-713, January 22-26, 2006, Miami, Florida
[doi> 10.1145/1109557.1109634]
|
| |
8
|
|
| |
9
|
A. Chakrabarti, S. Khot, and X. Sun, Near-optimal lower bounds on the multi-party communication complexity of set disjointness., in IEEE Conference on Computational Complexity, 2003, pp. 107--117.
|
 |
10
|
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
 |
14
|
Graham Cormode , Flip Korn , S. Muthukrishnan , Divesh Srivastava, Space- and time-efficient deterministic algorithms for biased quantiles over data streams, Proceedings of the twenty-fifth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, June 26-28, 2006, Chicago, IL, USA
[doi> 10.1145/1142351.1142389]
|
| |
15
|
|
| |
16
|
Joan Feigenbaum , Sampath Kannan , Andrew McGregor , Siddharth Suri , Jian Zhang, Graph distances in the streaming model: the value of space, Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, January 23-25, 2005, Vancouver, British Columbia
|
| |
17
|
|
 |
18
|
Johannes Gehrke , Flip Korn , Divesh Srivastava, On computing correlated aggregates over continual data streams, Proceedings of the 2001 ACM SIGMOD international conference on Management of data, p.13-24, May 21-24, 2001, Santa Barbara, California, United States
|
 |
19
|
|
| |
20
|
S. Guha, P. Indyk, and A. McGregor, Sketching information divergences, in Conference on Learning Theory, 2007, pp. 424--438.
|
 |
21
|
|
| |
22
|
S. Guha and A. McGregor, Lower bounds for quantile estimation in random-order and multipass streaming, in International Colloquium on Automata, Languages and Programming, 2007, pp. 704--715.
|
| |
23
|
S. Guha and A. McGregor, Space-efficient sampling, in AISTATS, 2007, pp. 169--176.
|
 |
24
|
|
 |
25
|
|
| |
26
|
T. S. Jayram, R. Kumar, and D. Sivakumar, Simple lower bound on one-way Gap-Hamming., in www.cse.iitk.ac.in/users/sganguly/slides/ravikumar.pdf, 2007.
|
| |
27
|
|
| |
28
|
|
| |
29
|
|
|