| Query-aware partitioning for monitoring massive network data streams |
| Full text |
Pdf
(429 KB)
|
Source
|
International Conference on Management of Data
archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data
table of contents
Vancouver, Canada
SESSION: Industrial Session 3: Streams, Conversations and Verification:
table of contents
Pages 1135-1146
Year of Publication: 2008
ISBN:978-1-60558-102-6
|
|
Authors
|
|
Theodore Johnson
|
AT&T Labs - Research, Florham Park, NJ, USA
|
|
Muthu S. Muthukrishnan
|
Rutgers University, Piscataway, NJ, USA
|
|
Vladislav Shkapenyuk
|
AT&T Labs - Research, Florham Park, NJ, USA
|
|
Oliver Spatscheck
|
AT&T Labs - Research, Florham Park, NJ, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 45, Downloads (12 Months): 155, Citation Count: 0
|
|
|
ABSTRACT
Data Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single centralized server. In particular, a single-server instance of our DSMS, Gigascope, cannot keep up with the processing demands of the new OC-786 networks, which can generate more than 100 million packets per second. In this paper, we explore a mechanism for the distributed processing of very high speed data streams. Existing distributed DSMSs employ two mechanisms for distributing the load across the participating machines: partitioning of the query execution plans and partitioning of the input data stream in a query-independent fashion. However, for a large class of queries, both approaches fail to reduce the load as compared to centralized system, and can even lead to an increase in the load. In this paper we present an alternative approach - query-aware data stream partitioning that allows for more efficient scaling. We present methods for analyzing any given query set and choose the optimal partitioning scheme, and show how to reconcile potentially conflicting requirements that different queries might place on partitioning. We conclude with experiments on a small cluster of processing nodes on high-rate network traffic feed that demonstrates with different query sets that our methods effectively distribute the load across all processing nodes and facilitate efficient scaling whenever more processing nodes becomes available.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
D. J. Abadi et al. The Design of the Borealis Stream Processing Engine, CIDR 2005.
|
| |
2
|
D. J. Abadi et al.. Aurora: A new model and architecture for data stream management. VLDB Journal, 12(2):120--139, 2003.
|
| |
3
|
D. J. Abadi W. Lindner, S. Madden, and J. Schuler. An Integration Framework for Sensor Networks and Data Stream Management Systems. Demonstration. VLDB 2004
|
| |
4
|
A. Arasu et al. STREAM: The Stanford stream data manager. IEEE Data Engineering Bulletin, 26(1):19--26, 2003.
|
| |
5
|
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proc. ACM PODS, pages 1--16, 2002.
|
| |
6
|
M. Balazinska, H. Balakrishnan, M. Stonebraker. Contract-Based Load Management in Federated Distributed Systems. NSDI 2004, San Francisco, CA, March 2004.
|
| |
7
|
S. Chandrasekaran et al. TelegraphCQ: Continuous dataflow processing for an uncertain world. CIDR 2003.
|
| |
8
|
J. Chen, D.J. DeWitt, F. Tian and Y. Wang, NiagaraCQ: A Scalable Continuous Query System for Internet Databases. SIGMOD 2000 pg. 379--390.
|
| |
9
|
M. Cherniack, H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel, Y. Xing, and S. Zdonik. Scalable Distributed Stream Processing. CIDR 2003.
|
| |
10
|
G. Cormode, T. Johnson, F. Korn, S. Muthukrishnan, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. SIGMOD Conference, pages 35--46. ACM, 2004.
|
| |
11
|
C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. ACM SIGMOD, pages 647--651, 2003.
|
| |
12
|
D. DeWitt, J. Gray. Parallel database systems: the future of high performance database systems. Communications of the ACM, v.35 n.6, p.85--98, June 1992.
|
| |
13
|
M. Ivanova and T. Risch. Customizable Parallel Execution of Scientific Stream Queries. VLDB 2005.
|
| |
14
|
R. R. Kompella, S. Singh, and G. Varghese. On scalable attack detection in the network. In ACM Internet Measurement Conference IMC 2004, pages 187 -- 200.
|
| |
15
|
D. Kossmann. The state of the art in distributed query processing. ACM Computing Surveys, 32(4):422--469, 2000.
|
| |
16
|
Per-Ake Larson. Data Reduction by Partial Preaggregation. 18th International Conference on Data Engineering (ICDE'02), 2002.
|
| |
17
|
J. Li, D. Maier, K. Tufte, V. Papadimos, P. A. Tucker: No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record 34(1): 39--44 (2005)
|
| |
18
|
S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science, Vol 2, 2005.
|
| |
19
|
J. Rao, C. Zhang, N. Megiddo, G. M. Lohman: Automating physical database design in a parallel database. SIGMOD Conference 2002: 558--569
|
| |
20
|
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, M. J. Franklin. Flux: An Adaptive Partitioning Operator for Continuous Query Systems. ICDE 2003
|
| |
21
|
M. Sullivan and A. Heybey. Tribeca: A system for managing large databases of network traffic. In Proc.USENIX Annual Technical Conf., 1998
|
|