skip to main content
10.1145/1807167.1807207acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

Published:06 June 2010Publication History

ABSTRACT

Sort is a fundamental kernel used in many database operations. In-memory sorts are now feasible; sort performance is limited by compute flops and main memory bandwidth rather than I/O. In this paper, we present a competitive analysis of comparison and non-comparison based sorting algorithms on two modern architectures - the latest CPU and GPU architectures. We propose novel CPU radix sort and GPU merge sort implementations which are 2X faster than previously published results. We perform a fair comparison of the algorithms using these best performing implementations on both architectures. While radix sort is faster on current architectures, the gap narrows from CPU to GPU architectures. Merge sort performs better than radix sort for sorting keys of large sizes - such keys will be required to accommodate the increasing cardinality of future databases. We present analytical models for analyzing the performance of our implementations in terms of architectural features such as core count, SIMD and bandwidth. Our obtained performance results are successfully predicted by our models. Our analysis points to merge sort winning over radix sort on future architectures due to its efficient utilization of SIMD and low bandwidth utilization. We simulate a 64-core platform with varying SIMD widths under constant bandwidth per core constraints, and show that large data sizes of 240 (one trillion records), merge sort performance on large key sizes is up to 3X better than radix sort for large SIMD widths on future architectures. Therefore, merge sort should be the sorting method of choice for future databases.

References

  1. CUDPP: CUDA Data Parallel Primitives Library. gpgpu.org/developer/cudpp/.Google ScholarGoogle Scholar
  2. Intel Performance Primitives. http://software.intel.com/en-us/intel-ipp/.Google ScholarGoogle Scholar
  3. V. H. Allan, R. B. Jones, R. M. Lee, and S. J. Allan. Software pipelining. ACM Comput. Surv., 27(3):367--432, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. E. Batcher. Sorting networks and their applications. In Spring Joint Computer Conference, pages 307--314, 1968. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Binnig, S. Hildenbrand, and F. Färber. Dictionary-based order-preserving string compression for column stores. In SIGMOD, pages 283--296, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. E. Blelloch. Vector models for data-parallel computing. MIT Press, Cambridge, MA, USA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Bohannon, P. Mcllroy, and R. Rastogi. Main-memory index structures with fixed-size partial keys. In SIGMOD, pages 163--174, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhry, R. Cypher, M. Ekman, M. Karlsson, et al. Rock: A High-Performance Sparc CMT Processor. IEEE Micro, 29(2):6--16, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Chhugani, A. D. Nguyen, V. W. Lee, et al. Efficient implementation of sorting on multi-core SIMD CPU architectures. VLDB, 1(2):1313--1324, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Cormen, C. Leiserson, and R. Rivest. Intro. to Algorithms. MIT Press, 1990.Google ScholarGoogle Scholar
  11. R. S. Francis, I. D. Mathieson, and L. Pannan. A fast, simple algorithm to balance a parallel multiway merge. In Proceedings of PARLE, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Govindaraju, J. Gray, R. Kumar, et al. GPUTeraSort: High Performance Graphics Co-processor Sorting. In SIGMOD, pages 325--336, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Inoue, T. Moriyama, H. Komatsu, et al. AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. In PACT, pages 189--198, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Intel Advanced Vector Extensions Programming Reference. 2008, http://softwarecommunity.intel.com/isn/downloads/intelavx/Intel-AVXProgramming-Reference-31943302.pdf.Google ScholarGoogle Scholar
  15. D. Jiménex-González, J. J. Navarro, and J.-L. Larriba-Pey. CC-Radix:a Cache Conscious Sorting Based on Radix sort. Euromicro Conference on Parallel, Distributed, and Network-Based Processing, 0:101, 2003.Google ScholarGoogle Scholar
  16. C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378--1389, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Lamarca and R. E. Ladner. The Influence of Caches on the Performance of Sorting. In Journal of Algorithms, pages 370--379, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Leischner, V. Osipov, and P. Sanders. Gpu sample sort, 2009.Google ScholarGoogle Scholar
  19. NVIDIA. Fermi Architecture White Paper, 2009.Google ScholarGoogle Scholar
  20. NVIDIA. NVIDIA CUDA Programming Guide 2.3. 2009.Google ScholarGoogle Scholar
  21. M. Reilly. When multicore isn't enough: Trends and the future for multi-multicore systems. In HPEC, 2008.Google ScholarGoogle Scholar
  22. N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, et al. Larrabee: A Many-Core x86 Architecture for Visual Computing. SIGGRAPH, 27(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Sengupta, M. Harris, Y. Zhang, and J. D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, pages 97--106, Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Sintorn and U. Assarsson. Fast Parallel GPU-Sorting Using a Hybrid Algorithm. In Workshop on GPGPU, 2007.Google ScholarGoogle Scholar
  26. K. Thearling and S. Smith. An improved supercomputer sorting benchmark. In Proceedings of Supercomputing '92, pages 14--19, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Zagha and G. E. Blelloch. Radix sort for vector multiprocessors. In Proceedings of Supercomputing '91. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
      June 2010
      1286 pages
      ISBN:9781450300322
      DOI:10.1145/1807167

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader