Abstract
Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.
- Foto N. Afrati and Jeffrey D. Ullman. 2010. Optimizing joins in a map-reduce environment. In Proceedings of the EDBT. 99--110. Google ScholarDigital Library
- Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB 5, 10 (June 2012), 1064--1075. Google ScholarDigital Library
- Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Ozsu. 2013. MultiCore, main-memory joins: Sort vs. hash revisited. PVLDB 7, 1 (Sept. 2013), 85--96. Google ScholarDigital Library
- Philip A. Bernstein and D. W. Chiu. 1981. Using semi-joins to solve relational queries. J. ACM 28, 1 (Jan. 1981). Google ScholarDigital Library
- Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie. 1981. Query processing in a system for distributed databases. ACM Trans. Database Syst. 6, 4 (Dec. 1981), 602--625. Google ScholarDigital Library
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7 (July 1970). Google ScholarDigital Library
- Surajit Chaudhuri and Vivek Narasayya. 2016. TPC-H Data Generation with Skew. Retrieved from https://www.microsoft.com/en-us/download/details.aspx?id=52430.Google Scholar
- Shumo Chu, Magdalena Balazinska, and Dan Suciu. 2015. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the SIGMOD. 63--78. Google ScholarDigital Library
- Transaction Processing Performance Council. 2014. The TPC-H Benchmark, 2.17.1. Retrieved from http://www.tpc.org/tpch.Google Scholar
- Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: A workload-driven approach to database replication and partitioning. PVLDB 3, 1--2 (Sept. 2010), 48--57. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the OSDI. 137--150. Google ScholarDigital Library
- David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsiao, and Rick Rasmussen. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1 (Mar. 1990), 44--62. Google ScholarDigital Library
- David J. DeWitt and Jim Gray. 1992. Parallel database systems: The future of high performance database systems. Comm. ACM 35 (1992), 85--98. Google ScholarDigital Library
- David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the SIGMOD. 1--8. Google ScholarDigital Library
- Jennie Duggan, Olga Papaemmanouil, Leilani Battle, and Michael Stonebraker. 2015. Skew-aware join optimization for array databases. In Proceedings of the SIGMOD. 123--135. Google ScholarDigital Library
- George Eadon and others. 2008. Supporting table partitioning by reference in oracle. In Proceedings of the SIGMOD. 1111--1122. Google ScholarDigital Library
- Mohammed Elseidy, Abdallah Elguindy, Aleksandar Vitorovic, and Christoph Koch. 2014. Scalable and adaptive online joins. PVLDB 7, 6 (Feb. 2014), 441--452. Google ScholarDigital Library
- Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB 4, 9 (June 2011), 575--585. Google ScholarDigital Library
- Robert Epstein, Michael Stonebraker, and Eugene Wong. 1978. Distributed query processing in a relational data base system. In Proceedings of the SIGMOD. 169--180. Google ScholarDigital Library
- Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2009. Spinning relations: High-speed networks for distributed join processing. In Proceedings of the DaMoN. 27--33. Google ScholarDigital Library
- Lukasz Golab, Marios Hadjieleftheriou, Howard Karloff, and Barna Saha. 2014. Distributed data placement to minimize communication costs via graph partitioning. In Proceedings of the SSDBM. Article 20. Google ScholarDigital Library
- Martin Grund and others. 2010. HYRISE: A main memory hybrid storage engine. PVLDB 4, 2 (Nov. 2010), 105--116. Google ScholarDigital Library
- Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia Ailamaki, and Babak Falsafi. 2009. Shore-MT: A scalable storage manager for the multicore era. In Proceedings of the EDBT. 24--35. Google ScholarDigital Library
- Changkyu Kim and others. 2009. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. PVLDB 2, 2 (Aug. 2009), 1378--1389. Google ScholarDigital Library
- Changkyu Kim, Jongsoo Park, Nadathur Satish, Hongrae Lee, Pradeep Dubey, and Jatin Chhugani. 2012. CloudRAMsort: Fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In Proceedings of the SIGMOD. 841--850. Google ScholarDigital Library
- Masaru Kitsuregawa, Hidehiko Tanaka, and Tohru Moto-Oka. 1983. Application of hash to data base machine and its architecture. New Gen. Comput. 1 (1983), 63--74.Google ScholarCross Ref
- K. Ashwin Kumar, Amol Deshpande, and Samir Khuller. 2013. Data placement and replica selection for improving co-location in distributed environments. CoRR abs/1302.4168 (2013). Retrieved from http://arxiv.org/abs/1302.4168.Google Scholar
- Viktor Leis et al. 2014. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In Proceedings of the SIGMOD. 743--754. Google ScholarDigital Library
- Daniel Lemire et al. 2015. Decoding billions of integers per second through vectorization. Softw.: Pract. Exper. 45, 1 (Jan. 2015), 1--29. Google ScholarDigital Library
- Zhe Li and Kenneth A. Ross. 1995. PERF join: An alternative to two-way semijoin and Bloomjoin. In Proceedings of the CIKM. 137--144. Google ScholarDigital Library
- Lothar F. Mackert and Guy M. Lohman. 1986. R<sup>*</sup> optimizer validation and performance evaluation for distributed queries. In Proceedings of the VLDB. 149--159. Google ScholarDigital Library
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing database architecture for the new bottleneck: Memory access. VLDB J. 9, 3 (2000), 231--246. Google ScholarDigital Library
- Manish Mehta and David J. DeWitt. 1997. Data placement in shared-nothing parallel database systems. VLDB J. 6, 1 (Feb. 1997). Google ScholarDigital Library
- Ingo Müller et al. 2015. Cache-efficient aggregation: Hashing is sorting. In Proceedings of the SIGMOD. 1123--1136. Google ScholarDigital Library
- J. K. Mullin. 1990. Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng. 16, 5 (May 1990). Google ScholarDigital Library
- Rimma Nehme and Nicolas Bruno. 2011. Automated partitioning design in parallel database systems. In Proceedings of the SIGMOD. Google ScholarDigital Library
- Alper Okcan and Mirek Riedewald. 2011. Processing theta-joins using mapreduce. In Proceedings of the SIGMOD. 949--960. Google ScholarDigital Library
- Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the SIGMOD. 61--72. Google ScholarDigital Library
- Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the SIGMOD. 165--178. Google ScholarDigital Library
- Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo—A vector algebra for portable database performance on modern hardware. PVLDB 9, 14 (Oct. 2016), 1707--1718. Google ScholarDigital Library
- Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD vectorization for in-memory databases. In Proceedings of the SIGMOD. 1493--1508. Google ScholarDigital Library
- Orestis Polychroniou and Kenneth A. Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the SIGMOD. 755--766. Google ScholarDigital Library
- Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient lightweight compression alongside fast scans. In Proceedings of the DaMoN. Google ScholarDigital Library
- Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. 2014. Track join: Distributed joins with minimal network traffic. In Proceedings of the SIGMOD. ACM, 1483--1494. Google ScholarDigital Library
- Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. 2013. SWORD: Scalable workload-aware data placement for transactional workloads. In Proceedings of the EDBT. 430--441. Google ScholarDigital Library
- Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. PVLDB 6, 11 (Aug. 2013), 1080--1091. Google ScholarDigital Library
- Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. 2002. Automating physical database design in a parallel database. In Proceedings of the SIGMOD. 558--569. Google ScholarDigital Library
- Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. 2016. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In Proceedings of the ICDE. 1194--1205.Google ScholarCross Ref
- Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, Alfons Kemper, and Thomas Neumann. 2014. Locality-sensitive operators for parallel main-memory database clusters. In Proceedings of the ICDE. 17--30.Google ScholarCross Ref
- N. Roussopoulos and H. Kang. 1991. A pipeline N-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Engin. 3, 4 (Dec. 1991), 486--495. Google ScholarDigital Library
- Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the SIGMOD. 351--362. Google ScholarDigital Library
- Donovan A. Schneider and David J. DeWitt. 1989. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In Proceedings of the SIGMOD. 110--121. Google ScholarDigital Library
- James W. Stamos and Honesty C. Young. 1993. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4, 12 (Dec. 1993), 1345--1354. Google ScholarDigital Library
- Thomas Stöhr et al. 2000. Multi-dimensional database allocation for parallel data warehouses. In Proceedings of the VLDB. 273--284. Google ScholarDigital Library
- Mike Stonebraker et al. 2005. C-store: A column-oriented DBMS. In Proceedings of the VLDB. 553--564. Google ScholarDigital Library
- Michael Stonebraker et al. 2007. The end of an architectural era: (It’s time for a complete rewrite). In Proceedings of the VLDB. 1150--1160. Google ScholarDigital Library
- Rebecca Taft et al. 2014. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. PVLDB 8, 3 (Nov. 2014), 245--256. Google ScholarDigital Library
- Yuanyuan Tian, Fatma Özcan, Tao Zou, Romulo Goncalves, and Hamid Pirahesh. 2016. Building a hybrid warehouse: Efficient joins between data stored in HDFS and enterprise warehouse. ACM TODS 41, 4 (2016), 21. Google ScholarDigital Library
- Tolga Urhan and Michael J. Franklin. 2000. XJoin: A reactively scheduled pipelined join operator. IEEE Data Engin. Bulletin 23 (June 2000), 27--33.Google Scholar
- Jan Wassenberg and Peter Sanders. 2011. Engineering a multi core radix sort. In Proceedings of the EuroPar. 160--169. Google ScholarDigital Library
- Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB 2, 1 (Aug. 2009), 385--394. Google ScholarDigital Library
- Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the SIGMOD. 13--24. Google ScholarDigital Library
- Yang Ye, Kenneth A. Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the DaMoN. Google ScholarDigital Library
- Feng Yu, Wen-Chi Hou, Cheng Luo, Dunren Che, and Mengxia Zhu. 2013. CS2: A new database synopsis for query estimation. In Proceedings of the SIGMOD. 469--480. Google ScholarDigital Library
- Erfan Zamanian, Carsten Binnig, and Abdallah Salama. 2015. Locality-aware partitioning in parallel database systems. In Proceedings of the SIGMOD. 17--30. Google ScholarDigital Library
Index Terms
- Distributed Joins and Data Placement for Minimal Network Traffic
Recommendations
Distributed GPU Joins on Fast RDMA-capable Networks
PACMMODIn this paper, we present a novel pipelined GPU join that accelerates the performance of distributed DBMSs by leveraging GPU resources on fast networks. A key insight is that we enable pipelined join execution by overlapping the network shuffling with ...
PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning
In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly ...
Near-Optimal Distributed Band-Joins through Recursive Partitioning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataWe consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per ...
Comments