skip to main content
research-article

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Published: 19 June 2010 Publication History

Abstract

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

References

[1]
CUDA BLAS Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.
[2]
CUDA CUFFT Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.
[3]
General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.
[4]
D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving predictable performance through better memory controller placement in many-core cmps. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, 2009.
[5]
A. R. Alameldeen. Using compression to improve chip multiprocessor performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A.
[6]
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.
[7]
D. H. Bailey. A high-performance fft algorithm for vector supercomputers-abstract. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics.
[8]
N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009.
[9]
C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing. http://quantcatalyst.com/download.php? file=DerivativesPricing.pdf.
[10]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, New York, NY, USA, 2008. ACM.
[11]
S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. Multi-execution: multicore caching for data-similar executions. SIGARCH Comput. Archit. News, 37(3):164--173, 2009.
[12]
B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society.
[13]
Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V.W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.
[14]
Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical simulations on next-generation architecture with many cores. Intel Technology Journal, 11, 2007.
[15]
J. Chhugani, A. D. Nguyen, V. W. Lee,W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008.
[16]
K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
[17]
F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, special issue on "Signal Processing on Platforms with Multiple Cores", 26(6):90--102, 2009.
[18]
M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.
[19]
L. Genovese. Graphic processing units: A possible answer to HPC. In 4th ABINIT Developer Workshop, 2009.
[20]
N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, NY, USA, 2006. ACM.
[21]
N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.
[22]
S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.
[23]
Intel Advanced Vector Extensions Programming Reference.
[24]
Intel. SSE4 Programming Reference. 2007.
[25]
C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society.
[26]
J. R. Johnson, R.W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits Syst. Signal Process., 9(4):449--500, 1990.
[27]
C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In ACM SIGMOD, 2010.
[28]
S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 162--173, New York, NY, USA, 2007. ACM.
[29]
S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip multiprocessors. In ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, pages 441--452, Washington, DC, USA, 2008. IEEE Computer Society.
[30]
N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.
[31]
P. Lyman and H. R. Varian. How much information. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.
[32]
NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/cuda_home.html, 2009.
[33]
Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, March 2007.
[34]
V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing. http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo/doc/MonteCarlo.pdf.
[35]
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.
[36]
R. Ramanathan. Extending the world.s most popular processor architecture. Intel Whitepaper.
[37]
K. K. Rangan, G.-Y.Wei, and D. Brooks. Thread motion: fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News, 37(3):302--313, 2009.
[38]
R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH '06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA, 2006. ACM.
[39]
N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009.
[40]
N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM SIGMOD, 2010.
[41]
L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008.
[42]
M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008.
[43]
M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V.W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009.
[44]
The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.
[45]
J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD. In International Journal of Computational Fluid Dynamics, volume 22, pages 443--456, 2008.
[46]
N. Univ. of Illinois. Technical reference: Base operating system and extensions, volume 2, 2009.
[47]
F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.
[48]
V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.
[49]
V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.
[50]
S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM.
[51]
S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.
[52]
W. Xu and K. Mueller. A performance-driven study of regularization methods for gpu-accelerated iterative ct. In Workshop on High Performance Image Reconstruction (HPIR), 2009.
[53]
Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In International Conference on Computer Science and Software Engineering, volume 3, pages 198--201, 2008.

Cited By

View all
  • (2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
  • (2024) From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL ‐Inspired Interface Concurrency and Computation: Practice and Experience10.1002/cpe.831337:1Online publication date: 28-Oct-2024
  • (2023)Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow PostsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616365(1444-1456)Online publication date: 30-Nov-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News
ACM SIGARCH Computer Architecture News  Volume 38, Issue 3
ISCA '10
June 2010
508 pages
ISSN:0163-5964
DOI:10.1145/1816038
Issue’s Table of Contents
  • cover image ACM Conferences
    ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
    June 2010
    520 pages
    ISBN:9781450300537
    DOI:10.1145/1815961
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010
Published in SIGARCH Volume 38, Issue 3

Check for updates

Author Tags

  1. cpu architecture
  2. gpu architecture
  3. performance analysis
  4. performance measurement
  5. software optimization
  6. throughput computing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)562
  • Downloads (Last 6 weeks)123
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
  • (2024) From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL ‐Inspired Interface Concurrency and Computation: Practice and Experience10.1002/cpe.831337:1Online publication date: 28-Oct-2024
  • (2023)Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow PostsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616365(1444-1456)Online publication date: 30-Nov-2023
  • (2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
  • (2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
  • (2023)AdaPool: Exponential Adaptive Pooling for Information-Retaining DownsamplingIEEE Transactions on Image Processing10.1109/TIP.2022.322750332(251-266)Online publication date: 2023
  • (2023)Homomorphic Encryption on GPUIEEE Access10.1109/ACCESS.2023.326558311(84168-84186)Online publication date: 2023
  • (2023)Innermost many-sorted term rewriting on GPUsScience of Computer Programming10.1016/j.scico.2022.102910225(102910)Online publication date: Jan-2023
  • (2022)Performance Evaluation of Massively Parallel Systems Using SPEC OMP SuiteComputers10.3390/computers1105007511:5(75)Online publication date: 5-May-2022
  • (2022)Modernizing the NEURON Simulator for Sustainability, Portability, and PerformanceFrontiers in Neuroinformatics10.3389/fninf.2022.88404616Online publication date: 27-Jun-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media