- Arora, N., Shringarpure, A., Vuduc, R.W. Direct N-body Kernels for multicore platforms. In ICPP (2009), 379--387. Google ScholarDigital Library
- Asanovic, K., Bodik, R., Catanzaro, B., Gebis, J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-183, 2006.Google Scholar
- Bienia, C., Kumar, S., Singh, J.P., Li, K. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008), 72--81. Google ScholarDigital Library
- Brace, A., Gatarek, D., Musiela, M. The market model of interest rate dynamics. Mathematical Finance 7, 2 (1997),127--155.Google ScholarCross Ref
- Chen, Y.K., Chhugani, J., et al. Convergence of recognition, mining and synthesis workloads and its implications. IEEE 96, 5 (2008),790--807.Google Scholar
- Chhugani, J., Nguyen, A.D., et al. Efficient implementation of sorting on multi-core simd cpu architecture. PVLDB 1, 2 (2008), 1313--1324. Google ScholarDigital Library
- Dally, W.J. The end of denial architecture and the rise of throughput computing. In Keynote Speech at Desgin Automation Conference (2010).Google Scholar
- Datta, K. Auto-tuning Stencil Codes for Cache-based Multicore Platforms. PhD thesis, EECS Department, University of California, Berkeley (Dec 2009). Google ScholarDigital Library
- Fowler, M. Domain Specific Languages, 1st edn. Addison-Wesley Professional, Boston, MA 2010. Google ScholarDigital Library
- Giles, M.B. Monte Carlo Evaluation of Sensitivities in Computational Finance. Technical report. Oxford University Computing Laboratory, 2007.Google Scholar
- Intel. A quick, easy and reliable way to improve threaded performance, 2010. software.intel.com/articles/intel-cilk-plus.Google Scholar
- Ismail, L., Guerchi, D. Performance evaluation of convolution on the cell broadband engine processor. IEEE PDS 22, 2 (2011), 337--351. Google ScholarDigital Library
- Kachelrieb, M., Knaup, M., Bockenbach, O. Hyperfast perspective cone-beam backprojection. IEEE Nuclear Science 3, (2006), 1679--1683.Google Scholar
- Kim, C., Chhugani, J., Satish, N., et al. FAST: fast architecture sensitive tree search on modern CPUs and GPUs. In SIGMOD (2010). 339--350. Google ScholarDigital Library
- Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., et al. Debunking the 100X GPU vs. CPU myth: An evaluation of throughput computing on CPU and GPU. In ISCA (2010). 451--460. Google ScholarDigital Library
- T. N. Mudge. Power: A first-class architectural design constraint. IEEE Computer 34, 4 (2001), 52--58. Google ScholarDigital Library
- Nguyen, A., Satish, N., et al. 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In SC10 (2010). 1--13. Google ScholarDigital Library
- Nuzman, D., Henderson, R. Multi-platform auto-vectorization. In CGO (2006). 281--294. Google ScholarDigital Library
- Nvidia. CUDA C Best Practices Guide 3, 2 (2010).Google Scholar
- Podlozhnyuk, V. Black--Scholes option pricing. Nvidia, 2007. http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/BlackScholes/doc/BlackScholes.pdf.Google Scholar
- Ryoo, S., Rodrigues, C.I., Baghsorkhi, S.S., Stone, S.S., Kirk, D.B., Hwu, W.M.W. Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In PPoPP (2008). 73--82. Google ScholarDigital Library
- Satish, N., Kim, C., Chhugani, J., et al. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In SIGMOD (2010). 351--362. Google ScholarDigital Library
- Satish, N., Kim, C., Chhugani, J., Saito, H., Krishnaiyer, R., Smelyanskiy, M., et al. Can traditional programming bridge the Ninja performance gap for parallel computing applications? In ISCA (2012). 440--451. Google ScholarDigital Library
- Smelyanskiy, M., Holmes, D., et al. Mapping high-fidelity volume rendering to CPU, GPU and many-core. IEEE TVCG, 15, 6(2009), 1563--1570. Google ScholarDigital Library
- Sukop, M.C., Thorne, D.T., Jr. Lattice Boltzmann Modeling: An Introduction for Geoscientists and Engineers, 2006. Google ScholarDigital Library
- Tian, X., Saito, H., Girkar, M., Preis, S., Kozhukhov, S., Cherkasov, A.G., Nelson, C., Panchenko, N., Geva, R., Compiling C/C++ SIMD extensions for function and loop vectorizaion on multicore-SIMD processors. In IPDPS Workshops (Springer, NY, 2012). 2349--2358. Google ScholarDigital Library
Index Terms
- Can traditional programming bridge the ninja performance gap for parallel computing applications?
Recommendations
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12Current processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Can traditional programming bridge the Ninja performance gap for parallel computing applications?
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer ArchitectureCurrent processor trends of integrating more cores with wider SIMD units, along with a deeper and complex memory hierarchy, have made it increasingly more challenging to extract performance from applications. It is believed by some that traditional ...
Comments