research-article

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Authors:

Jatin Chhugani,

Michael Deisher,

Anthony D. Nguyen,

Nadathur Satish,

Mikhail Smelyanskiy,

Srinivas Chennupaty,

Per Hammarlund,

Pradeep DubeyAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 38, Issue 3

Pages 451 - 460

https://doi.org/10.1145/1816038.1816021

Published: 19 June 2010 Publication History

Abstract

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for today's multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

References

[1]

CUDA BLAS Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUBLAS_Library_2.1.pdf, 2008.

[2]

CUDA CUFFT Library. http://developer.download.nvidia.com/compute/cuda/2_1/toolkit/docs/ CUFFT_Library_2.1.pdf, 2008.

[3]

General-purpose computation on graphics hardware. http://gpgpu.org/, 2009.

[4]

D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti. Achieving predictable performance through better memory controller placement in many-core cmps. In ISCA '09: Proceedings of the 36th annual international symposium on Computer architecture, 2009.

Digital Library

[5]

A. R. Alameldeen. Using compression to improve chip multiprocessor performance. PhD thesis, Madison, WI, USA, 2006. Adviser-Wood, David A.

Digital Library

[6]

K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, and K. A. Yelick. The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-183, 2006.

[7]

D. H. Bailey. A high-performance fft algorithm for vector supercomputers-abstract. In Proceedings of the Third SIAM Conference on Parallel Processing for Scientific Computing, page 114, Philadelphia, PA, USA, 1989. Society for Industrial and Applied Mathematics.

Digital Library

[8]

N. Bell and M. Garland. Implementing sparse matrix-vector multiplication on throughput-oriented processors. In SC '09: Proceedings of the 2009 ACM/IEEE conference on Supercomputing, 2009.

Digital Library

[9]

C. Bennemann, M. Beinker, D. Egloff, and M. Gauckler. Teraflops for games and derivatives pricing. http://quantcatalyst.com/download.php? file=DerivativesPricing.pdf.

[10]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: characterization and architectural implications. In PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques, pages 72--81, New York, NY, USA, 2008. ACM.

Digital Library

[11]

S. Biswas, D. Franklin, A. Savage, R. Dixon, T. Sherwood, and F. T. Chong. Multi-execution: multicore caching for data-similar executions. SIGARCH Comput. Archit. News, 37(3):164--173, 2009.

Digital Library

[12]

B. Black, M. Annavaram, N. Brekelbaum, J. DeVale, L. Jiang, G. H. Loh, D. McCaule, P. Morrow, D. W. Nelson, D. Pantuso, P. Reed, J. Rupley, S. Shankar, J. Shen, and C. Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469--479, Washington, DC, USA, 2006. IEEE Computer Society.

Digital Library

[13]

Y. K. Chen, J. Chhugani, P. Dubey, C. J. Hughes, D. Kim, S. Kumar, V.W. Lee, A. D. Nguyen, M. Smelyanskiy, and M. Smelyanskiy. Convergence of recognition, mining, and synthesis workloads and its implications. Proceedings of the IEEE, 96(5):790--807, 2008.

[14]

Y.-K. Chen, J. Chhugani, C. J. Hughes, D. Kim, S. Kumar, V. W. Lee, A. Lin, A. D. Nguyen, E. Sifakis, and M. Smelyanskiy. High-performance physical simulations on next-generation architecture with many cores. Intel Technology Journal, 11, 2007.

[15]

J. Chhugani, A. D. Nguyen, V. W. Lee,W. Macy, M. Hagog, Y.-K. Chen, A. Baransi, S. Kumar, and P. Dubey. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB, 1(2):1313--1324, 2008.

Digital Library

[16]

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter, L. Oliker, D. Patterson, J. Shalf, and K. Yelick. Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[17]

F. Franchetti, M. Püschel, Y. Voronenko, S. Chellappa, and J. M. F. Moura. Discrete Fourier transform on multicore. IEEE Signal Processing Magazine, special issue on "Signal Processing on Platforms with Multiple Cores", 26(6):90--102, 2009.

[18]

M. Frigo, Steven, and G. Johnson. The design and implementation of fftw3. In Proceedings of the IEEE, volume 93, pages 216--231, 2005.

[19]

L. Genovese. Graphic processing units: A possible answer to HPC. In 4th ABINIT Developer Workshop, 2009.

[20]

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: high performance graphics co-processor sorting for large database management. In SIGMOD '06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 325--336, NY, USA, 2006. ACM.

Digital Library

[21]

N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manferdelli. High performance discrete fourier transforms on graphics processors. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--12, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[22]

S. Hong and H. Kim. An analytical model for a gpu architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News, 37(3):152--163, 2009.

Digital Library

[23]

Intel Advanced Vector Extensions Programming Reference.

[24]

Intel. SSE4 Programming Reference. 2007.

[25]

C. Jiang and M. Snir. Automatic tuning matrix multiplication performance on graphics hardware. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques, pages 185--196, Washington, DC, USA, 2005. IEEE Computer Society.

Digital Library

[26]

J. R. Johnson, R.W. Johnson, D. Rodriquez, and R. Tolimieri. A methodology for designing, modifying, and implementing fourier transform algorithms on various architectures. Circuits Syst. Signal Process., 9(4):449--500, 1990.

Digital Library

[27]

C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. Nguyen, T. Kaldewey, V. Lee, S. Brandt, and P. Dubey. FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. In ACM SIGMOD, 2010.

Digital Library

[28]

S. Kumar, C. J. Hughes, and A. Nguyen. Carbon: architectural support for fine-grained parallelism on chip multiprocessors. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 162--173, New York, NY, USA, 2007. ACM.

Digital Library

[29]

S. Kumar, D. Kim, M. Smelyanskiy, Y.-K. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. Atomic vector operations on chip multiprocessors. In ISCA '08: Proceedings of the 35th International Symposium on Computer Architecture, pages 441--452, Washington, DC, USA, 2008. IEEE Computer Society.

Digital Library

[30]

N. Leischner, V. Osipov, and P. Sanders. Fermi Architecture White Paper, 2009.

[31]

P. Lyman and H. R. Varian. How much information. http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/, 2003.

[32]

NVIDIA. NVIDIA CUDA Zone. http://www.nvidia.com/object/cuda_home.html, 2009.

[33]

Owens, D. John, Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, E. Aaron, Purcell, and J. Timothy. A survey of general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, March 2007.

[34]

V. Podlozhnyuk and M. Harris. Monte Carlo Option Pricing. http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/MonteCarlo/doc/MonteCarlo.pdf.

[35]

M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.

[36]

R. Ramanathan. Extending the world.s most popular processor architecture. Intel Whitepaper.

[37]

K. K. Rangan, G.-Y.Wei, and D. Brooks. Thread motion: fine-grained power management for multi-core systems. SIGARCH Comput. Archit. News, 37(3):302--313, 2009.

Digital Library

[38]

R. Sathe and A. Lake. Rigid body collision detection on the gpu. In SIGGRAPH '06: ACM SIGGRAPH 2006 Research posters, page 49, New York, NY, USA, 2006. ACM.

Digital Library

[39]

N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IPDPS, pages 1--10, 2009.

Digital Library

[40]

N. Satish, C. Kim, J. Chhugani, A. Nguyen, V. Lee, D. Kim, and P. Dubey. Fast Sort on CPUs and GPUs: A Case For Bandwidth Oblivious SIMD Sort. In ACM SIGMOD, 2010.

Digital Library

[41]

L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27(3):1--15, August 2008.

Digital Library

[42]

M. Silberstein, A. Schuster, D. Geiger, A. Patney, and J. D. Owens. Efficient computation of sum-products on gpus through software-managed cache. In Proceedings of the 22nd ACM International Conference on Supercomputing, pages 309--318, June 2008.

Digital Library

[43]

M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, D. Carmean, D. Hanson, P. Dubey, K. Augustine, D. Kim, A. Kyker, V.W. Lee, A. D. Nguyen, L. Seiler, and R. A. Robb. Mapping high-fidelity volume rendering for medical imaging to cpu, gpu and many-core architectures. IEEE Trans. Vis. Comput. Graph., 15(6):1563--1570, 2009.

Digital Library

[44]

The IMPACT Research Group, UIUC. Parboil benchmark suite. http://impact.crhc.illinois.edu/parboil.php.

[45]

J. Tolke and M. Krafczyk. TeraFLOP computing on a desktop pc with GPUs for 3D CFD. In International Journal of Computational Fluid Dynamics, volume 22, pages 443--456, 2008.

Digital Library

[46]

N. Univ. of Illinois. Technical reference: Base operating system and extensions, volume 2, 2009.

[47]

F. Vazquez, E. M. Garzon, J.A.Martinez, and J.J.Fernandez. The sparse matrix vector product on GPUs. Technical report, University of Almeria, June 2009.

[48]

V. Volkov and J. Demmel. LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, Berkeley, May 2008.

[49]

V. Volkov and J.W. Demmel. Benchmarking GPUs to tune dense linear algebra. In SC '08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing, pages 1--11, Piscataway, NJ, USA, 2008. IEEE Press.

Digital Library

[50]

S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and J. Demmel. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM.

Digital Library

[51]

S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, 2009.

Digital Library

[52]

W. Xu and K. Mueller. A performance-driven study of regularization methods for gpu-accelerated iterative ct. In Workshop on High Performance Image Reconstruction (HPIR), 2009.

[53]

Z. Yang, Y. Zhu, and Y. Pu. Parallel Image Processing Based on CUDA. In International Conference on Computer Science and Software Engineering, volume 3, pages 198--201, 2008.

Digital Library

Cited By

Bontes JGain J(2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675782
Bilotta G(2024) From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL ‐Inspired Interface Concurrency and Computation: Practice and Experience10.1002/cpe.831337:1Online publication date: 28-Oct-2024
https://doi.org/10.1002/cpe.8313
Yang WZhang CPan MChandra SBlincoe KTonella P(2023)Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow PostsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616365(1444-1456)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616365
Show More Cited By

Index Terms

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Recommendations

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of ...
Many-core GPU computing with NVIDIA CUDA
ICS '08: Proceedings of the 22nd annual international conference on Supercomputing

In the past, graphics processors were special-purpose hardwired application accelerators, suitable only for conventional graphics applications. Modern GPUs are fully programmable, massively parallel floating point processors. In this talk I will ...
Vectorized algorithm for multidimensional Monte Carlo integration on modern GPU, CPU and MIC architectures

The aim of this paper is to show that the multidimensional Monte Carlo integration can be efficiently implemented on computers with modern multicore CPUs and manycore accelerators including Intel MIC and GPU architectures using a new vectorized version ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 38, Issue 3

ISCA '10

June 2010

508 pages

ISSN:0163-5964

DOI:10.1145/1816038

Issue’s Table of Contents

ISCA '10: Proceedings of the 37th annual international symposium on Computer architecture
June 2010
520 pages
ISBN:9781450300537
DOI:10.1145/1815961
General Chair:
André Seznec
INRIA Rennes
,
Program Chairs:
Uri Weiser
Technion
,
Ronny Ronen
Intel

Copyright © 2010 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 19 June 2010

Published in SIGARCH Volume 38, Issue 3

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

544
Total Citations
View Citations
24,022
Total Downloads

Downloads (Last 12 months)562
Downloads (Last 6 weeks)123

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bontes JGain J(2024)Redzone stream compaction: removing k items from a list in parallel O(k) timeACM Transactions on Parallel Computing10.1145/367578211:3(1-16)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675782
Bilotta G(2024) From GPU to CPU (and Beyond): Extending Hardware Support in GPUSPH Through a SYCL ‐Inspired Interface Concurrency and Computation: Practice and Experience10.1002/cpe.831337:1Online publication date: 28-Oct-2024
https://doi.org/10.1002/cpe.8313
Yang WZhang CPan MChandra SBlincoe KTonella P(2023)Understanding the Topics and Challenges of GPU Programming by Classifying and Analyzing Stack Overflow PostsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616365(1444-1456)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616365
Friesel BLütke Dreimann MSpinczyk O(2023)A Full-System Perspective on UPMEM PerformanceProceedings of the 1st Workshop on Disruptive Memory Systems10.1145/3609308.3625266(1-7)Online publication date: 23-Oct-2023
https://dl.acm.org/doi/10.1145/3609308.3625266
Hijma PHeldens SSclocco Avan Werkhoven BBal H(2023)Optimization Techniques for GPU ProgrammingACM Computing Surveys10.1145/357063855:11(1-81)Online publication date: 16-Mar-2023
https://dl.acm.org/doi/10.1145/3570638
Stergiou APoppe R(2023)AdaPool: Exponential Adaptive Pooling for Information-Retaining DownsamplingIEEE Transactions on Image Processing10.1109/TIP.2022.322750332(251-266)Online publication date: 2023
https://doi.org/10.1109/TIP.2022.3227503
Özcan AAyduman CTürkoğlu ESavaş E(2023)Homomorphic Encryption on GPUIEEE Access10.1109/ACCESS.2023.326558311(84168-84186)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3265583
van Eerd JGroote JHijma PMartens JOsama MWijs A(2023)Innermost many-sorted term rewriting on GPUsScience of Computer Programming10.1016/j.scico.2022.102910225(102910)Online publication date: Jan-2023
https://doi.org/10.1016/j.scico.2022.102910
Mustafa D(2022)Performance Evaluation of Massively Parallel Systems Using SPEC OMP SuiteComputers10.3390/computers1105007511:5(75)Online publication date: 5-May-2022
https://doi.org/10.3390/computers11050075
Awile OKumbhar PCornu NDura-Bernal SKing JLupton OMagkanaris IMcDougal RNewton APereira FSăvulescu ACarnevale NLytton WHines MSchürmann F(2022)Modernizing the NEURON Simulator for Sustainability, Portability, and PerformanceFrontiers in Neuroinformatics10.3389/fninf.2022.88404616Online publication date: 27-Jun-2022
https://doi.org/10.3389/fninf.2022.884046
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents