ACM Home Page
Please provide us with feedback. Feedback
Handling data skew in parallel joins in shared-nothing systems
Full text PdfPdf (1.41 MB)
Source
International Conference on Management of Data archive
Proceedings of the 2008 ACM SIGMOD international conference on Management of data table of contents
Vancouver, Canada
SESSION: Industrial Session 1: Query Optimization and Performance table of contents
Pages 1043-1052  
Year of Publication: 2008
ISBN:978-1-60558-102-6
Authors
Yu Xu  Teradata, San Diego, CA, USA
Pekka Kostamaa  Teradata, San Diego, CA, USA
Xin Zhou  Teradata, San Diego, CA, USA
Liang Chen  UCSD, San Diego, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 44,   Downloads (12 Months): 97,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1376616.1376720
What is a DOI?

ABSTRACT

Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
TPC Benchmark H (decision support) standard specification http://www.tpc.org.
 
2
K. Alsabti and S. Ranka. Skew-insensitive parallel algorithms for relational join. In HIPC, page 367, 1998.
 
3
M. Bamha and G. Hains. Frequency-adaptive join for shared nothing machines. Progress in computer research, pages 227--241, 2001.
 
4
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143--154, 1979.
 
5
H.M. Dewan, M. A. Hernández, K. W. Mok, and S. J. Stolfo. Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew. In PDIS, pages 40--49, 1994.
 
6
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
 
7
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, 1992.
 
8
FrankǎOlken and DoronǎRotem. Random sampling from databases: a survey. Statistics and Computing, 5(1):25--42, 1995.
 
9
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.
 
10
L. Harada and M. Kitsuregawa. Dynamic join product skew handling for hash-joins in shared-nothing database systems. In DASFAA, pages 246--255, 1995.
 
11
K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In VLDB, pages 525--535, 1991.
 
12
E. G. C. Jr., M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput., 7(1):1--17, 1978.
 
13
M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990.
 
14
M. S. Lakshmi and P. S. Yu. Effectiveness of parallel joins. IEEE Transactions on Knowledge and Data Engineering, 2(4):410--424, 1990.
 
15
A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In SIGMOD Conference, pages 119--128, 1993.
 
16
M. Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9(1):4--9, 1986.
 
17
C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In VLDB, pages 537--548, 1991.
 
18
J. L.Wolf, D. M. Dias, and P. S. Yu. A parallel sort merge join algorithm for managing data skew. IEEE Trans. Parallel Distrib. Syst., 4(1):70--86, 1993.
 
19
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In ICDE, pages 200--209, 1991.
 
20
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Trans. Knowl. Data Eng., 6(6):990--997, 1994.
 
21
X. Zhou and M. E. Orlowska. Handling data skew in parallel hash join computation using two-phase scheduling. In IEEE 1st International Conference on Algorithm and Architecture for Parallel Processing, pages 527--536 vol.2, 1995.

Collaborative Colleagues:
Yu Xu: colleagues
Pekka Kostamaa: colleagues
Xin Zhou: colleagues
Liang Chen: colleagues