research-article

Cost-Effective Stream Join Algorithm on Cloud System

Authors:
Junhua Fang

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Rong Zhang

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Xiaotong Wang

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

,
Tom Z.J. Fu

Advanced Digital Sciences Center, Singapore, Singapore

Advanced Digital Sciences Center, Singapore, Singapore
View Profile

,
Zhenjie Zhang

Advanced Digital Sciences Center, Singapore, Singapore

Advanced Digital Sciences Center, Singapore, Singapore
View Profile

,
Aoying Zhou

East China Normal University, Shanghai, China

East China Normal University, Shanghai, China
View Profile

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementOctober 2016Pages 1773–1782https://doi.org/10.1145/2983323.2983773

Published:24 October 2016Publication History

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 1773–1782

ABSTRACT

Matrix-based scheme (Join-Matrix) can prefectly support distributed stream joins, especially for arbitrary join predicates, because it guarantees any tuples from two streams to meet with each other. However,the dynamics and unpredictability features of stream require quick actions on scheme changing. Otherwise, they may lead to degradation of system throughputs and increament of processing latency with the waste of system resources, such as CPUs and Memories. Since Join-Matrix model has the fixed processing architecture with replicated data, these kinds of adverseness will be magnified. Therefore, it is urgent to find a solution that preserves advantages of Join-Matrix model and promises a good usage to computation resources when it meets scheme changing. In this paper, we propose a cost-effective stream join algorithm, which ensures the adaptability of Join-Matrix but with lower resources consumption. Specifically, a varietal matrix generation algorithm is proposed to generate an irregular matrix scheme for assigning the minimal number of tasks; a lightweight migration algorithm is designed to ensure state migration at a low cost; a complete load balance process framework is described to guarantee the correctness during the scheme changing. We conduct extensive experiments to compare our method with baseline systems on both benchmarks and real-workloads, and explain the results in detail.

References

Apache Storm. http://storm.apache.org/.Google Scholar
The TPC-H Benchmark. http://www.tpc.org/tpch.Google Scholar
R. Ananthanarayanan, V. Basker, S. Das, and et al. Photon: fault-tolerant and scalable joining of continuous data streams. In SIGMOD, pages 577--588, 2013. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
J.-P. Dittrich, B. Seeger, and et al. Progressive merge join: A genetic and non-blocking sort-based join algorithm. In VLDB, pages 299--310, 2002. Google ScholarDigital Library
M. Elseidy, A. Elguindy, A.and Vitorovic, and C. Koch. Scalable and adaptive online joins. In VLDB, pages 441--452, 2014. Google ScholarDigital Library
R. S. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In SIGMOD, pages 169--180, 1978. Google ScholarDigital Library
J. Fang, R. Zhang, X. Wang, and Z. A. Flexible and adaptive stream join algorithm. In APWeb, Accepted, 2016.Google ScholarCross Ref
B. Gedik. Partitioning functions for stateful data parallelism in stream processing. VLDBJ, 23(4):517--539, 2014. Google ScholarDigital Library
R. Huebsch, M. Garofalakis, J. Hellerstein, and I. Stoica. Advanced join strategies for large-scale distributed computation. In VLDB, pages 1484--1495, 2014. Google ScholarDigital Library
Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In SIGMOD, pages 299--310, 1999. Google ScholarDigital Library
Y. Kwon, M. Balazinska, and et al. Skewtune: mitigating skew in mapreduce applications. In SIGMOD, pages 25--36, 2012. Google ScholarDigital Library
Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. Scalable distributed stream join processing. In SIGMOD, pages 811--825, 2015. Google ScholarDigital Library
B. Liu, Y. Zhu, M. Jbantova, and et al. A dynamically adaptive distributed system for processing complex continuous queries. In VLDB, pages 1338--1341, 2005. Google ScholarDigital Library
M. Mokbel, M. Lu, and W. Aref. Hash-merge join: Non-blocking join algorithm for producing fast and early join results. In ICDE, pages 251--263, 2004. Google ScholarDigital Library
M. A. U. Nasir, M. Serafini, and et al. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE, 2016.Google ScholarCross Ref
A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
Z. Qian, Y. He, C. Su, and et al. Timestream: reliable stream computation in the cloud. In Eurosys, pages 1--14, 2013. Google ScholarDigital Library
J. W. Stamos and H. C. Young. A symmetric and replicate algorithm for distributed joins. IEEE Transactions on Parallel and Distributed Systems, 4(12):1345--1354, 1993. Google ScholarDigital Library
N. ufler, B.and Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In ICDE, pages 522--533, 2012. Google ScholarDigital Library
T. Urhan and M. Franklin. Dynamic pipeline scheduling for improving interactive query performance. In VLDB, pages 501--510, 2001. Google ScholarDigital Library
A. Vitorovic, M. ElSeidy, and C. Koch. Load balancing and skew resilience for parallel joins. In ICDE, 2016.Google ScholarCross Ref
A. Wilschut and P. Apers. Dataflow query execution in a aarallel main-memory environment. Distributed and Parallel Databases, 1(1):103--128, 1993. Google ScholarDigital Library
Y. Xing, J. Hwang, U. Cetintemel, and S. Zdonik. Providing resiliency to load variations in distributed stream processing. In VLDB, pages 775--786, 2006. Google ScholarDigital Library
Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD, pages 1043--1052, 2008. Google ScholarDigital Library
M. Zaharia, T. Das, and et al. Discretized streams: fault-tolerant streaming computation at scale. In SIGOPS, pages 423--438, 2013. Google ScholarDigital Library

Index Terms

Cost-Effective Stream Join Algorithm on Cloud System

Recommendations

Distributed stream join under workload variance

Flexible and self-adaptive stream join processing plays an important role in a parallel shared-nothing environments. Join-Matrix model is a high-performance model which is resilient to data skew and supports arbitrary join predicates for taking random ...
Read More
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223

MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
Read More
Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
October 2016
2566 pages
ISBN:9781450340731
DOI:10.1145/2983323
General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cost effective
dstributed stream join
matrix model
theta-join
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '16 Paper Acceptance Rate160of701submissions,23%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 233
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cost-Effective Stream Join Algorithm on Cloud System

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distributed stream join under workload variance

Two MRJs for Multi-way Theta-Join in MapReduce

Distributed stream join query processing with semijoins