research-article

Public Access

Distributed Joins and Data Placement for Minimal Network Traffic

Authors:
Orestis Polychroniou

Columbia University, USA

Columbia University, USA
View Profile

,
Wangda Zhang

Columbia University, USA

Columbia University, USA
View Profile

,
Kenneth A. Ross

Columbia University, USA

Columbia University, USA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 43 Issue 3Article No.: 14pp 1–45https://doi.org/10.1145/3241039

Published:16 November 2018Publication History

ACM Transactions on Database Systems

Abstract

Network communication is the slowest component of many operators in distributed parallel databases deployed for large-scale analytics. Whereas considerable work has focused on speeding up databases on modern hardware, communication reduction has received less attention. Existing parallel DBMSs rely on algorithms designed for disks with minor modifications for networks. A more complicated algorithm may burden the CPUs but could avoid redundant transfers of tuples across the network. We introduce track join, a new distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. Track join extends the trade-off options between CPU and network. Track join explicitly detects and exploits locality, also allowing for advanced placement of tuples beyond hash partitioning on a single attribute. We propose a novel data placement algorithm based on track join that minimizes the total network cost of multiple joins across different dimensions in an analytical workload. Our evaluation shows that track join outperforms hash join on the most expensive queries of real workloads regarding both network traffic and execution time. Finally, we show that our data placement optimization approach is both robust and effective in minimizing the total network cost of joins in analytical workloads.

References

Foto N. Afrati and Jeffrey D. Ullman. 2010. Optimizing joins in a map-reduce environment. In Proceedings of the EDBT. 99--110. Google ScholarDigital Library
Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. 2012. Massively parallel sort-merge joins in main memory multi-core database systems. PVLDB 5, 10 (June 2012), 1064--1075. Google ScholarDigital Library
Cagri Balkesen, Gustavo Alonso, Jens Teubner, and M. Tamer Ozsu. 2013. MultiCore, main-memory joins: Sort vs. hash revisited. PVLDB 7, 1 (Sept. 2013), 85--96. Google ScholarDigital Library
Philip A. Bernstein and D. W. Chiu. 1981. Using semi-joins to solve relational queries. J. ACM 28, 1 (Jan. 1981). Google ScholarDigital Library
Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie. 1981. Query processing in a system for distributed databases. ACM Trans. Database Syst. 6, 4 (Dec. 1981), 602--625. Google ScholarDigital Library
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Comm. ACM 13, 7 (July 1970). Google ScholarDigital Library
Surajit Chaudhuri and Vivek Narasayya. 2016. TPC-H Data Generation with Skew. Retrieved from https://www.microsoft.com/en-us/download/details.aspx?id=52430.Google Scholar
Shumo Chu, Magdalena Balazinska, and Dan Suciu. 2015. From theory to practice: Efficient join query evaluation in a parallel database system. In Proceedings of the SIGMOD. 63--78. Google ScholarDigital Library
Transaction Processing Performance Council. 2014. The TPC-H Benchmark, 2.17.1. Retrieved from http://www.tpc.org/tpch.Google Scholar
Carlo Curino, Evan Jones, Yang Zhang, and Sam Madden. 2010. Schism: A workload-driven approach to database replication and partitioning. PVLDB 3, 1--2 (Sept. 2010), 48--57. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the OSDI. 137--150. Google ScholarDigital Library
David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsiao, and Rick Rasmussen. 1990. The gamma database machine project. IEEE Trans. Knowl. Data Engin. 2, 1 (Mar. 1990), 44--62. Google ScholarDigital Library
David J. DeWitt and Jim Gray. 1992. Parallel database systems: The future of high performance database systems. Comm. ACM 35 (1992), 85--98. Google ScholarDigital Library
David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the SIGMOD. 1--8. Google ScholarDigital Library
Jennie Duggan, Olga Papaemmanouil, Leilani Battle, and Michael Stonebraker. 2015. Skew-aware join optimization for array databases. In Proceedings of the SIGMOD. 123--135. Google ScholarDigital Library
George Eadon and others. 2008. Supporting table partitioning by reference in oracle. In Proceedings of the SIGMOD. 1111--1122. Google ScholarDigital Library
Mohammed Elseidy, Abdallah Elguindy, Aleksandar Vitorovic, and Christoph Koch. 2014. Scalable and adaptive online joins. PVLDB 7, 6 (Feb. 2014), 441--452. Google ScholarDigital Library
Mohamed Y. Eltabakh, Yuanyuan Tian, Fatma Özcan, Rainer Gemulla, Aljoscha Krettek, and John McPherson. 2011. CoHadoop: Flexible data placement and its exploitation in Hadoop. PVLDB 4, 9 (June 2011), 575--585. Google ScholarDigital Library
Robert Epstein, Michael Stonebraker, and Eugene Wong. 1978. Distributed query processing in a relational data base system. In Proceedings of the SIGMOD. 169--180. Google ScholarDigital Library
Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. 2009. Spinning relations: High-speed networks for distributed join processing. In Proceedings of the DaMoN. 27--33. Google ScholarDigital Library
Lukasz Golab, Marios Hadjieleftheriou, Howard Karloff, and Barna Saha. 2014. Distributed data placement to minimize communication costs via graph partitioning. In Proceedings of the SSDBM. Article 20. Google ScholarDigital Library
Martin Grund and others. 2010. HYRISE: A main memory hybrid storage engine. PVLDB 4, 2 (Nov. 2010), 105--116. Google ScholarDigital Library
Ryan Johnson, Ippokratis Pandis, Nikos Hardavellas, Anastasia Ailamaki, and Babak Falsafi. 2009. Shore-MT: A scalable storage manager for the multicore era. In Proceedings of the EDBT. 24--35. Google ScholarDigital Library
Changkyu Kim and others. 2009. Sort vs. hash revisited: Fast join implementation on modern multi-core CPUs. PVLDB 2, 2 (Aug. 2009), 1378--1389. Google ScholarDigital Library
Changkyu Kim, Jongsoo Park, Nadathur Satish, Hongrae Lee, Pradeep Dubey, and Jatin Chhugani. 2012. CloudRAMsort: Fast and efficient large-scale distributed RAM sort on shared-nothing cluster. In Proceedings of the SIGMOD. 841--850. Google ScholarDigital Library
Masaru Kitsuregawa, Hidehiko Tanaka, and Tohru Moto-Oka. 1983. Application of hash to data base machine and its architecture. New Gen. Comput. 1 (1983), 63--74.Google ScholarCross Ref
K. Ashwin Kumar, Amol Deshpande, and Samir Khuller. 2013. Data placement and replica selection for improving co-location in distributed environments. CoRR abs/1302.4168 (2013). Retrieved from http://arxiv.org/abs/1302.4168.Google Scholar
Viktor Leis et al. 2014. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In Proceedings of the SIGMOD. 743--754. Google ScholarDigital Library
Daniel Lemire et al. 2015. Decoding billions of integers per second through vectorization. Softw.: Pract. Exper. 45, 1 (Jan. 2015), 1--29. Google ScholarDigital Library
Zhe Li and Kenneth A. Ross. 1995. PERF join: An alternative to two-way semijoin and Bloomjoin. In Proceedings of the CIKM. 137--144. Google ScholarDigital Library
Lothar F. Mackert and Guy M. Lohman. 1986. R<sup>*</sup> optimizer validation and performance evaluation for distributed queries. In Proceedings of the VLDB. 149--159. Google ScholarDigital Library
Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing database architecture for the new bottleneck: Memory access. VLDB J. 9, 3 (2000), 231--246. Google ScholarDigital Library
Manish Mehta and David J. DeWitt. 1997. Data placement in shared-nothing parallel database systems. VLDB J. 6, 1 (Feb. 1997). Google ScholarDigital Library
Ingo Müller et al. 2015. Cache-efficient aggregation: Hashing is sorting. In Proceedings of the SIGMOD. 1123--1136. Google ScholarDigital Library
J. K. Mullin. 1990. Optimal semijoins for distributed database systems. IEEE Trans. Softw. Eng. 16, 5 (May 1990). Google ScholarDigital Library
Rimma Nehme and Nicolas Bruno. 2011. Automated partitioning design in parallel database systems. In Proceedings of the SIGMOD. Google ScholarDigital Library
Alper Okcan and Mirek Riedewald. 2011. Processing theta-joins using mapreduce. In Proceedings of the SIGMOD. 949--960. Google ScholarDigital Library
Andrew Pavlo, Carlo Curino, and Stanley Zdonik. 2012. Skew-aware automatic database partitioning in shared-nothing, parallel OLTP systems. In Proceedings of the SIGMOD. 61--72. Google ScholarDigital Library
Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches to large-scale data analysis. In Proceedings of the SIGMOD. 165--178. Google ScholarDigital Library
Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. 2016. Voodoo—A vector algebra for portable database performance on modern hardware. PVLDB 9, 14 (Oct. 2016), 1707--1718. Google ScholarDigital Library
Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. 2015. Rethinking SIMD vectorization for in-memory databases. In Proceedings of the SIGMOD. 1493--1508. Google ScholarDigital Library
Orestis Polychroniou and Kenneth A. Ross. 2014. A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort. In Proceedings of the SIGMOD. 755--766. Google ScholarDigital Library
Orestis Polychroniou and Kenneth A. Ross. 2015. Efficient lightweight compression alongside fast scans. In Proceedings of the DaMoN. Google ScholarDigital Library
Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. 2014. Track join: Distributed joins with minimal network traffic. In Proceedings of the SIGMOD. ACM, 1483--1494. Google ScholarDigital Library
Abdul Quamar, K. Ashwin Kumar, and Amol Deshpande. 2013. SWORD: Scalable workload-aware data placement for transactional workloads. In Proceedings of the EDBT. 430--441. Google ScholarDigital Library
Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. PVLDB 6, 11 (Aug. 2013), 1080--1091. Google ScholarDigital Library
Jun Rao, Chun Zhang, Nimrod Megiddo, and Guy Lohman. 2002. Automating physical database design in a parallel database. In Proceedings of the SIGMOD. 558--569. Google ScholarDigital Library
Wolf Rödiger, Sam Idicula, Alfons Kemper, and Thomas Neumann. 2016. Flow-join: Adaptive skew handling for distributed joins over high-speed networks. In Proceedings of the ICDE. 1194--1205.Google ScholarCross Ref
Wolf Rödiger, Tobias Mühlbauer, Philipp Unterbrunner, Angelika Reiser, Alfons Kemper, and Thomas Neumann. 2014. Locality-sensitive operators for parallel main-memory database clusters. In Proceedings of the ICDE. 17--30.Google ScholarCross Ref
N. Roussopoulos and H. Kang. 1991. A pipeline N-way join algorithm based on the 2-way semijoin program. IEEE Trans. Knowl. Data Engin. 3, 4 (Dec. 1991), 486--495. Google ScholarDigital Library
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. 2010. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In Proceedings of the SIGMOD. 351--362. Google ScholarDigital Library
Donovan A. Schneider and David J. DeWitt. 1989. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In Proceedings of the SIGMOD. 110--121. Google ScholarDigital Library
James W. Stamos and Honesty C. Young. 1993. A symmetric fragment and replicate algorithm for distributed joins. IEEE Trans. Parallel Distrib. Syst. 4, 12 (Dec. 1993), 1345--1354. Google ScholarDigital Library
Thomas Stöhr et al. 2000. Multi-dimensional database allocation for parallel data warehouses. In Proceedings of the VLDB. 273--284. Google ScholarDigital Library
Mike Stonebraker et al. 2005. C-store: A column-oriented DBMS. In Proceedings of the VLDB. 553--564. Google ScholarDigital Library
Michael Stonebraker et al. 2007. The end of an architectural era: (It’s time for a complete rewrite). In Proceedings of the VLDB. 1150--1160. Google ScholarDigital Library
Rebecca Taft et al. 2014. E-store: Fine-grained elastic partitioning for distributed transaction processing systems. PVLDB 8, 3 (Nov. 2014), 245--256. Google ScholarDigital Library
Yuanyuan Tian, Fatma Özcan, Tao Zou, Romulo Goncalves, and Hamid Pirahesh. 2016. Building a hybrid warehouse: Efficient joins between data stored in HDFS and enterprise warehouse. ACM TODS 41, 4 (2016), 21. Google ScholarDigital Library
Tolga Urhan and Michael J. Franklin. 2000. XJoin: A reactively scheduled pipelined join operator. IEEE Data Engin. Bulletin 23 (June 2000), 27--33.Google Scholar
Jan Wassenberg and Peter Sanders. 2011. Engineering a multi core radix sort. In Proceedings of the EuroPar. 160--169. Google ScholarDigital Library
Thomas Willhalm, Nicolae Popovici, Yazan Boshmaf, Hasso Plattner, Alexander Zeier, and Jan Schaffner. 2009. SIMD-scan: Ultra fast in-memory table scan using on-chip vector processing units. PVLDB 2, 1 (Aug. 2009), 385--394. Google ScholarDigital Library
Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2013. Shark: SQL and rich analytics at scale. In Proceedings of the SIGMOD. 13--24. Google ScholarDigital Library
Yang Ye, Kenneth A. Ross, and Norases Vesdapunt. 2011. Scalable aggregation on multicore processors. In Proceedings of the DaMoN. Google ScholarDigital Library
Feng Yu, Wen-Chi Hou, Cheng Luo, Dunren Che, and Mengxia Zhu. 2013. CS2: A new database synopsis for query estimation. In Proceedings of the SIGMOD. 469--480. Google ScholarDigital Library
Erfan Zamanian, Carsten Binnig, and Abdallah Salama. 2015. Locality-aware partitioning in parallel database systems. In Proceedings of the SIGMOD. 17--30. Google ScholarDigital Library

Index Terms

Distributed Joins and Data Placement for Minimal Network Traffic
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Join algorithms
      2. Parallel and distributed DBMSs
        Relational parallel and distributed DBMSs
    2. Information integration
      1. Data warehouses

Recommendations

Distributed GPU Joins on Fast RDMA-capable Networks
PACMMOD

In this paper, we present a novel pipelined GPU join that accelerates the performance of distributed DBMSs by leveraging GPU resources on fast networks. A key insight is that we enable pipelined join execution by overlapping the network shuffling with ...
Read More
PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning
In big data query processing, there is a trade-off between query accuracy and query efficiency, for example, sampling query approaches trade-off query completeness for efficiency. In this article, we argue that query performance can be significantly ...
Read More
Near-Optimal Distributed Band-Joins through Recursive Partitioning
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 43, Issue 3
Best of PODS 2017, Best of ICDT 2017 and Regular Papers
September 2018
164 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3284689
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 November 2018
- Accepted: 1 July 2018
- Revised: 1 June 2018
- Received: 1 March 2017
Published in tods Volume 43, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Distributed joins
data placement
network optimization
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 886
  Total Downloads
- Downloads (Last 12 months)124
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Distributed Joins and Data Placement for Minimal Network Traffic

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Distributed GPU Joins on Fast RDMA-capable Networks

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Distributed Joins and Data Placement for Minimal Network Traffic

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Distributed GPU Joins on Fast RDMA-capable Networks

PoBery: Possibly-complete Big Data Queries with Probabilistic Data Placement and Scanning

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media