ABSTRACT
Heavy hitters, which are items occurring with frequency above a given threshold, are an important aggregation and summary tool when processing data streams or data warehouses. Hierarchical heavy hitters (HHHs) have been introduced as a natural generalization for hierarchical data domains, including multi-dimensional data. An item x in a hierarchy is called a ϕ-HHH if its frequency after discounting the frequencies of all its descendant hierarchical heavy hitters exceeds ϕn, where ϕ is a user-specified parameter and n is the size of the data set. Recently, single-pass schemes have been proposed for computing ϕ-HHHs using space roughly O(1/ϕ log(ϕn)). The frequency estimates of these algorithms, however, hold only for the total frequencies of items, and not the discounted frequencies; this leads to false positives because the discounted frequency can be significantly smaller than the total frequency. This paper attempts to explain the difficulty of finding hierarchical heavy hitters with better accuracy. We show that a single-pass deterministic scheme that computes ϕ-HHHs in a d-dimensional hierarchy with any approximation guarantee must use Ω(1/ϕd+1) space. This bound is tight: in fact, we present a data stream algorithm that can report the ϕ-HHHs without false positives in O(1/ϕd+1) space.
- N. Alon, Y. Matias, M. Szegedy. The space complexity of approximating the frequency moments. J. Comput. Syst. Sci.58 (1999), 137--147. Google ScholarDigital Library
- A. Arasu and G. Manku. Approximate counts and quantiles over sliding windows. In Proc. 23rd PODS, 2004, ACM Press, pp. 286--296. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In 21st PODS, 2002, ACM Press, pp. 1--16. Google ScholarDigital Library
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In 29th Proc. ICALP, LNCS, Springer-Verlag, 2002, pp. 693--703. Google ScholarDigital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in data streams. In Proc. 29th Conf. on Very Large Data Bases, 2003. Google ScholarDigital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proc. ACM SIGMOD, 2004. Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. 10th European Sympos. Algorithms, LNCS 2461, 2002, pp. 348--360. Google ScholarDigital Library
- C. Estan, S. Savage, and G. Varghese. Automatically inferring patterns of resource consumption in network traffic. In Proc. of ACM SIGCOMM, 2003. Google ScholarDigital Library
- C. Estan and G. Varghese. New directions in traffic measurement and accounting. In Proc. 1st ACM SIGCOMM Workshop on Internet Measurement, 2001, pp. 75--80. Google ScholarDigital Library
- R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Transactions on Database Systems28 (1) (2003), 51--55. Google ScholarDigital Library
- G. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. 28th Conf. Very Large Data Bases, 2002, pp. 346--357. Google ScholarDigital Library
- J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Programming2 (1982), 143--152.Google ScholarDigital Library
- S. Muthukrishnan. Data streams: Algorithms and applications. Preprint, 2003.Google Scholar
Recommendations
Finding hierarchical heavy hitters in streaming data
Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for ...
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
Identifying correlated heavy-hitters in a two-dimensional data stream
We consider online mining of correlated heavy-hitters (CHH) from a data stream. Given a stream of two-dimensional data, a correlated aggregate query first extracts a substream by applying a predicate along a primary dimension, and then computes an ...
Comments