ABSTRACT
Matrix-based scheme (Join-Matrix) can prefectly support distributed stream joins, especially for arbitrary join predicates, because it guarantees any tuples from two streams to meet with each other. However,the dynamics and unpredictability features of stream require quick actions on scheme changing. Otherwise, they may lead to degradation of system throughputs and increament of processing latency with the waste of system resources, such as CPUs and Memories. Since Join-Matrix model has the fixed processing architecture with replicated data, these kinds of adverseness will be magnified. Therefore, it is urgent to find a solution that preserves advantages of Join-Matrix model and promises a good usage to computation resources when it meets scheme changing. In this paper, we propose a cost-effective stream join algorithm, which ensures the adaptability of Join-Matrix but with lower resources consumption. Specifically, a varietal matrix generation algorithm is proposed to generate an irregular matrix scheme for assigning the minimal number of tasks; a lightweight migration algorithm is designed to ensure state migration at a low cost; a complete load balance process framework is described to guarantee the correctness during the scheme changing. We conduct extensive experiments to compare our method with baseline systems on both benchmarks and real-workloads, and explain the results in detail.
- Apache Storm. http://storm.apache.org/.Google Scholar
- The TPC-H Benchmark. http://www.tpc.org/tpch.Google Scholar
- R. Ananthanarayanan, V. Basker, S. Das, and et al. Photon: fault-tolerant and scalable joining of continuous data streams. In SIGMOD, pages 577--588, 2013. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- J.-P. Dittrich, B. Seeger, and et al. Progressive merge join: A genetic and non-blocking sort-based join algorithm. In VLDB, pages 299--310, 2002. Google ScholarDigital Library
- M. Elseidy, A. Elguindy, A.and Vitorovic, and C. Koch. Scalable and adaptive online joins. In VLDB, pages 441--452, 2014. Google ScholarDigital Library
- R. S. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In SIGMOD, pages 169--180, 1978. Google ScholarDigital Library
- J. Fang, R. Zhang, X. Wang, and Z. A. Flexible and adaptive stream join algorithm. In APWeb, Accepted, 2016.Google ScholarCross Ref
- B. Gedik. Partitioning functions for stateful data parallelism in stream processing. VLDBJ, 23(4):517--539, 2014. Google ScholarDigital Library
- R. Huebsch, M. Garofalakis, J. Hellerstein, and I. Stoica. Advanced join strategies for large-scale distributed computation. In VLDB, pages 1484--1495, 2014. Google ScholarDigital Library
- Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In SIGMOD, pages 299--310, 1999. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, and et al. Skewtune: mitigating skew in mapreduce applications. In SIGMOD, pages 25--36, 2012. Google ScholarDigital Library
- Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. Scalable distributed stream join processing. In SIGMOD, pages 811--825, 2015. Google ScholarDigital Library
- B. Liu, Y. Zhu, M. Jbantova, and et al. A dynamically adaptive distributed system for processing complex continuous queries. In VLDB, pages 1338--1341, 2005. Google ScholarDigital Library
- M. Mokbel, M. Lu, and W. Aref. Hash-merge join: Non-blocking join algorithm for producing fast and early join results. In ICDE, pages 251--263, 2004. Google ScholarDigital Library
- M. A. U. Nasir, M. Serafini, and et al. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE, 2016.Google ScholarCross Ref
- A. Okcan and M. Riedewald. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
- Z. Qian, Y. He, C. Su, and et al. Timestream: reliable stream computation in the cloud. In Eurosys, pages 1--14, 2013. Google ScholarDigital Library
- J. W. Stamos and H. C. Young. A symmetric and replicate algorithm for distributed joins. IEEE Transactions on Parallel and Distributed Systems, 4(12):1345--1354, 1993. Google ScholarDigital Library
- N. ufler, B.and Augsten, A. Reiser, and A. Kemper. Load balancing in mapreduce based on scalable cardinality estimates. In ICDE, pages 522--533, 2012. Google ScholarDigital Library
- T. Urhan and M. Franklin. Dynamic pipeline scheduling for improving interactive query performance. In VLDB, pages 501--510, 2001. Google ScholarDigital Library
- A. Vitorovic, M. ElSeidy, and C. Koch. Load balancing and skew resilience for parallel joins. In ICDE, 2016.Google ScholarCross Ref
- A. Wilschut and P. Apers. Dataflow query execution in a aarallel main-memory environment. Distributed and Parallel Databases, 1(1):103--128, 1993. Google ScholarDigital Library
- Y. Xing, J. Hwang, U. Cetintemel, and S. Zdonik. Providing resiliency to load variations in distributed stream processing. In VLDB, pages 775--786, 2006. Google ScholarDigital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In SIGMOD, pages 1043--1052, 2008. Google ScholarDigital Library
- M. Zaharia, T. Das, and et al. Discretized streams: fault-tolerant streaming computation at scale. In SIGOPS, pages 423--438, 2013. Google ScholarDigital Library
Index Terms
- Cost-Effective Stream Join Algorithm on Cloud System
Recommendations
Distributed stream join under workload variance
Flexible and self-adaptive stream join processing plays an important role in a parallel shared-nothing environments. Join-Matrix model is a high-performance model which is resilient to data skew and supports arbitrary join predicates for taking random ...
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
Distributed stream join query processing with semijoins
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Comments