research-article

Free Access

Can traditional programming bridge the ninja performance gap for parallel computing applications?

Authors:
Nadathur Satish

Parallel Computing Lab, Intel Corp.

Parallel Computing Lab, Intel Corp.
View Profile

,
Changkyu Kim

Google Inc.

Google Inc.
View Profile

,
Jatin Chhugani

Ebay Inc.

Ebay Inc.
View Profile

,
Hideki Saito

Intel Compiler Lab, Intel Corp.

Intel Compiler Lab, Intel Corp.
View Profile

,
Rakesh Krishnaiyer

Intel Compiler Lab, Intel Corp.

Intel Compiler Lab, Intel Corp.
View Profile

,
Mikhail Smelyanskiy

Parallel Computing Lab, Intel Corp.

Parallel Computing Lab, Intel Corp.
View Profile

,
Milind Girkar

Intel Compiler Lab, Intel Corp.

Intel Compiler Lab, Intel Corp.
View Profile

,
Pradeep Dubey

Parallel Computing Lab, Intel Corp.

Parallel Computing Lab, Intel Corp.
View Profile

Authors Info & Claims

Communications of the ACM Volume 58 Issue 5May 2015pp 77–86https://doi.org/10.1145/2742910

Published:23 April 2015Publication History

Communications of the ACM

References

Arora, N., Shringarpure, A., Vuduc, R.W. Direct N-body Kernels for multicore platforms. In ICPP (2009), 379--387. Google ScholarDigital Library
Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
Bienia, C., Kumar, S., Singh, J.P., Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008), 72--81. Google ScholarDigital Library
Brace, A., Gatarek, D., Musiela, M. The market model of interest rate dynamics. Mathematical Finance 7, 2 (1997),127--155.Google ScholarCross Ref
Chen, Y.K., Chhugani, J., et al. Convergence of recognition, mining and synthesis workloads and its implications. IEEE 96, 5 (2008),790--807.Google Scholar
Chhugani, J., Nguyen, A.D., et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB 1, 2 (2008), 1313--1324. Google ScholarDigital Library
Dally, W.J. The end of denial architecture and the rise of throughput computing. In Keynote Speech at Desgin Automation Conference (2010).Google Scholar
Datta, K. Auto-tuning Stencil Codes for Cache-based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley (Dec 2009). Google ScholarDigital Library
Fowler, M. Domain Specific Languages, 1st edn. Addison-Wesley Professional, Boston, MA 2010. Google ScholarDigital Library
Giles, M.B. Monte Carlo Evaluation of Sensitivities in Computational Finance. Technical report. Oxford University Computing Laboratory, 2007.Google Scholar
Intel. A quick, easy and reliable way to improve threaded performance, 2010. software.intel.com/articles/intel-cilk-plus.Google Scholar
Ismail, L., Guerchi, D. Performance evaluation of convolution on the cell broadband engine processor. IEEE PDS 22, 2 (2011), 337--351. Google ScholarDigital Library
Kachelrieb, M., Knaup, M., Bockenbach, O. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science 3, (2006), 1679--1683.Google Scholar
Kim, C., Chhugani, J., Satish, N., et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD (2010). 339--350. Google ScholarDigital Library
Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In ISCA (2010). 451--460. Google ScholarDigital Library
T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer 34, 4 (2001), 52--58. Google ScholarDigital Library
Nguyen, A., Satish, N., et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC10 (2010). 1--13. Google ScholarDigital Library
Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO (2006). 281--294. Google ScholarDigital Library
Nvidia. CUDA C Best Practices Guide 3, 2 (2010).Google Scholar
Podlozhnyuk, V. Black--Scholes option pricing. Nvidia, 2007. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/BlackScholes/doc/BlackScholes.pdf.Google Scholar
Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP (2008). 73--82. Google ScholarDigital Library
Satish, N., Kim, C., Chhugani, J., et al. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In SIGMOD (2010). 351--362. Google ScholarDigital Library
Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In ISCA (2012). 440--451. Google ScholarDigital Library
Smelyanskiy, M., Holmes, D., et al. Mapping high-fidelity volume rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009), 1563--1570. Google ScholarDigital Library
Sukop, M.C., Thorne, D.T., Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers, 2006. Google ScholarDigital Library
Tian, X., Saito, H., Girkar, M., Preis, S., Kozhukhov, S., Cherkasov, A.G., Nelson, C., Panchenko, N., Geva, R., Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In IPDPS Workshops (Springer, NY, 2012). 2349--2358. Google ScholarDigital Library

Index Terms

Can traditional programming bridge the ninja performance gap for parallel computing applications?
1. Information systems
  1. Data management systems
    1. Database management system engines
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Read More
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Read More
Parallel Programming for Modern High Performance Computing Systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Communications of the ACM Volume 58, Issue 5
May 2015
80 pages
ISSN:0001-0782
EISSN:1557-7317
DOI:10.1145/2766485
Editor:
Moshe Y. Vardi
Association for Computing Machinery, New York, NY
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 April 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 5,991
  Total Downloads
- Downloads (Last 12 months)156
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF Chinese translation

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Can traditional programming bridge the ninja performance gap for parallel computing applications?

Communications of the ACM

References

Cited By

Index Terms

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Parallel Programming for Modern High Performance Computing Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Can traditional programming bridge the ninja performance gap for parallel computing applications?

Communications of the ACM

References

Cited By

Index Terms

Recommendations

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Can traditional programming bridge the Ninja performance gap for parallel computing applications?

Parallel Programming for Modern High Performance Computing Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media