research-article

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Authors:
Nadathur Satish

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Changkyu Kim

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Jatin Chhugani

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Anthony D. Nguyen

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Victor W. Lee

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Daehyun Kim

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

,
Pradeep Dubey

Intel Corporation, Santa Clara, CA, USA

Intel Corporation, Santa Clara, CA, USA
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 351–362https://doi.org/10.1145/1807167.1807207

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 351–362

ABSTRACT

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 2⁴⁰ (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

References

CUDPP: CUDA Data Parallel Primitives Library. gpgpu.org/developer/cudpp/.Google Scholar
Intel Performance Primitives. http://software.intel.com/en-us/intel-ipp/.Google Scholar
V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27(3):367--432, 1995. Google ScholarDigital Library
K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968. Google ScholarDigital Library
C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for column stores. In SIGMOD, pages 283--296, 2009. Google ScholarDigital Library
G. E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990. Google ScholarDigital Library
P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memory index structures with fixed-size partial keys. In SIGMOD, pages 163--174, 2001. Google ScholarDigital Library
S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, et al. Rock: A High-Performance Sparc CMT Processor. IEEE Micro, 29(2):6--16, 2009. Google ScholarDigital Library
J. Chhugani, A. D. Nguyen, V. W. Lee, et al. Efficient implementation of sorting on multi-core SIMD CPU architectures. VLDB, 1(2):1313--1324, 2008. Google ScholarDigital Library
T. Cormen, C. Leiserson, and R. Rivest. Intro. to Algorithms. MIT Press, 1990.Google Scholar
R. S. Francis, I. D. Mathieson, and L. Pannan. A fast, simple algorithm to balance a parallel multiway merge. In Proceedings of PARLE, 1993. Google ScholarDigital Library
N. Govindaraju, J. Gray, R. Kumar, et al. GPUTeraSort: High Performance Graphics Co-processor Sorting. In SIGMOD, pages 325--336, 2006. Google ScholarDigital Library
H. Inoue, T. Moriyama, H. Komatsu, et al. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT, pages 189--198, 2007. Google ScholarDigital Library
Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVXProgramming-Reference-31943302.pdf.Google Scholar
D. Jiménex-González, J. J. Navarro, and J.-L. Larriba-Pey. CC-Radix:a Cache Conscious Sorting Based on Radix sort. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, 0:101, 2003.Google Scholar
C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378--1389, 2009. Google ScholarDigital Library
A. Lamarca and R. E. Ladner. The Influence of Caches on the Performance of Sorting. In Journal of Algorithms, pages 370--379, 1997. Google ScholarDigital Library
N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort, 2009.Google Scholar
NVIDIA. Fermi Architecture White Paper, 2009.Google Scholar
NVIDIA. NVIDIA CUDA Programming Guide 2.3. 2009.Google Scholar
M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. In HPEC, 2008.Google Scholar
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009. Google ScholarDigital Library
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarDigital Library
S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007. Google ScholarDigital Library
E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on GPGPU, 2007.Google Scholar
K. Thearling and S. Smith. An improved supercomputer sorting benchmark. In Proceedings of Supercomputing '92, pages 14--19, 1992. Google ScholarDigital Library
M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings of Supercomputing '91. Google ScholarDigital Library

Index Terms

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor
IPDPS '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing

Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we ...
Read More
A comparison-free sorting algorithm on CPUs and GPUs

This paper presents a new sorting algorithm that sorts input data elements without any comparison operations between the data--comparison-free sorting. Our algorithm's time complexity is on the order of O(N) for both single- and multi-threaded CPU and ...
Read More
Fast in-place sorting with CUDA based on bitonic sort
PPAM'09: Proceedings of the 8th international conference on Parallel processing and applied mathematics: Part I

State of the art graphics processors provide high processing power and furthermore, the high programmability of GPUs offered by frameworks like CUDA increases their usability as high-performance coprocessors for general-purpose computing. Sorting is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
buffer
databases
many-core
merge
merge network
performance
radix
simd
sorting
tlp
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 166
  Total Citations
  View Citations
- 3,768
  Total Downloads
- Downloads (Last 12 months)125
- Downloads (Last 6 weeks)26
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor

A comparison-free sorting algorithm on CPUs and GPUs

Fast in-place sorting with CUDA based on bitonic sort