Abstract
The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or “kernels”) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 × and energy savings of 14.8 × with 96% accuracy.
- Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 198--209. Google ScholarDigital Library
- Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 26--33. Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 72--81. Google ScholarDigital Library
- Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 33--52. Google ScholarDigital Library
- Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. 2011. Proving programs robust. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 102--112. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing 68, 10 (2008), 1370--1380. Google ScholarDigital Library
- T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In IEEE International Symposium on Workload Characterization. 36--45. Google ScholarDigital Library
- H. Cho, L. Leem, and S. Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012), 546--558. Google ScholarDigital Library
- Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012a. Architecture support for accelerator-rich CMPs. In Proceedings of the 49th Annual Design Automation Conference. 843--849. Google ScholarDigital Library
- Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012b. CHARM: A composable heterogeneous accelerator-rich microprocessor. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design. 379--384. Google ScholarDigital Library
- Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 477--488. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312. Google ScholarDigital Library
- Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 449--460. Google ScholarDigital Library
- W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36. Google ScholarDigital Library
- Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 407--420. Google ScholarDigital Library
- Beayna Grigorian and Glenn Reinman. 2014. Dynamically adaptive and reliable approximate computing using light-weight error analysis. In NASA/ESA Conference on Adaptive Hardware and Systems. 248--255.Google ScholarCross Ref
- Michael Gschwind. 2006. Chip multiprocessing and the cell broadband engine. In Proceedings of the 3rd Conference on Computing Frontiers. 1--8. Google ScholarDigital Library
- Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR. Google ScholarDigital Library
- Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (March 1991), 251--257. Google ScholarDigital Library
- P3 International. 2014. Kill A Watt. http://www.p3international.com/products/p4400.html. (2014).Google Scholar
- Ujval J. Kapasi. 2004. Conditional Techniques for Stream Processing Kernels. Ph.D. Dissertation. Stanford University. Google ScholarDigital Library
- Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, and Brucek Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. 159--170. Google ScholarDigital Library
- Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture. 129--140. Google ScholarDigital Library
- Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39--55. Google ScholarDigital Library
- Chris Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference. 132--137.Google Scholar
- P. K. Meher. 2010. An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks. In Proceedings of the 18th IEEE/IFIP VLSI System on Chip Conference. 91--95.Google Scholar
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 235--246. Google ScholarDigital Library
- Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317. Google ScholarDigital Library
- Steffen Nissen. 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report. Department of Computer Science University of Copenhagen (DIKU).Google Scholar
- Nvidia. 2013. GeForce GTX 480. Retrieved from http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480.Google Scholar
- Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http://developer.nvidia.com/cuda-downloads.Google Scholar
- Nvidia. 2014b. CUDA Math Library. Retrieved from http://developer.nvidia.com/cuda-math-library.Google Scholar
- J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press. Google ScholarDigital Library
- Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10 (2000), 215--226.Google ScholarCross Ref
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533--536.Google ScholarCross Ref
- Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50. Google ScholarDigital Library
- Mehrzad Samadi, Janghaeng Lee, Davoud Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 13--24. Google ScholarDigital Library
- Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 164--174. Google ScholarDigital Library
- Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley. Google ScholarDigital Library
- J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2 (Feb. 2013), 279--290. Google ScholarDigital Library
- Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 18:1--18:15. Google ScholarDigital Library
- Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134. Google ScholarDigital Library
- J. E. Smith, G. Faanes, and R. Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 260--269. Google ScholarDigital Library
- Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 368--379. Google ScholarDigital Library
- Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 163--174. Google ScholarDigital Library
- Ingo Wald. 2011. Active thread compaction for GPU path tracing. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. 51--58. Google ScholarDigital Library
- Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications. International Journal of High Performance Computing Applications 26, 2 (2012), 170--185. Google ScholarDigital Library
- Guoqiang P. Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 30, 4 (2000), 451--462. Google ScholarDigital Library
Index Terms
Accelerating Divergent Applications on SIMD Architectures Using Neural Networks
Recommendations
Efficient SIMD implementation for accelerating convolutional neural network
ICCIP '18: Proceedings of the 4th International Conference on Communication and Information ProcessingConvolutional Neural Network (CNN) has been used in a variety of fields such as computer vision, speech recognition, and natural language processing. Because the amount of computation has increased tremendously, CNN has lately been accelerated through ...
SIMD divergence optimization through intra-warp compaction
ISCA '13: Proceedings of the 40th Annual International Symposium on Computer ArchitectureSIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
SIMD divergence optimization through intra-warp compaction
ICSA '13SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU ...
Comments