skip to main content
research-article
Free Access

Accelerating Divergent Applications on SIMD Architectures Using Neural Networks

Published:09 March 2015Publication History
Skip Abstract Section

Abstract

The purpose of this research is to find a neural-network-based solution to the well-known problem of branch divergence in Single Instruction Multiple Data (SIMD) architectures. Our approach differs from existing techniques that handle branch (or control-flow) divergence, which use costly hardware modifications, low-utilization masking techniques, or static prediction methods. As we examine divergent applications, we characterize the degree of data-dependent control flow seen in each and isolate the code regions (or “kernels”) that cause the most performance degradation due to branch divergence. We then train neural networks (NNs) offline to approximate these kernels and inject the NN computations directly into the applications as substitutes for the kernels they approximate. This essentially translates control flow into nondivergent computation, trading off precision for performance. As our methodology manipulates application source code directly, it is inherently platform agnostic and can be adopted as a general means for accelerating divergent applications on data-parallel architectures. In this article, we present the Neuralizer, an automated software flow for kernel identification, NN training, and NN integration, as well as supplementary user-controlled optimization techniques. Evaluating our approach on a variety of divergent applications run on a Graphics Processing Unit (GPU), we on average achieve performance gains of 13.6 × and energy savings of 14.8 × with 96% accuracy.

References

  1. Woongki Baek and Trishul M. Chilimbi. 2010. Green: A framework for supporting energy-conscious programming using controlled approximation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 198--209. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Michele Banko and Eric Brill. 2001. Scaling to very very large corpora for natural language disambiguation. In Proceedings of the 39th Annual Meeting on Association for Computational Linguistics. 26--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques. 72--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael Carbin, Sasa Misailovic, and Martin C. Rinard. 2013. Verifying quantitative reliability for programs that execute on unreliable hardware. In Proceedings of the ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications. 33--52. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Swarat Chaudhuri, Sumit Gulwani, Roberto Lublinerman, and Sara Navidpour. 2011. Proving programs robust. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 102--112. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. 2008. A performance study of general-purpose applications on graphics processors using CUDA. Journal of Parallel and Distributed Computing 68, 10 (2008), 1370--1380. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Chen, Y. Chen, M. Duranton, Q. Guo, A. Hashmi, M. Lipasti, A. Nere, S. Qiu, M. Sebag, and O. Temam. 2012. BenchNN: On the broad potential application scope of hardware neural network accelerators. In IEEE International Symposium on Workload Characterization. 36--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Cho, L. Leem, and S. Mitra. 2012. ERSA: Error resilient system architecture for probabilistic applications. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 31, 4 (2012), 546--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012a. Architecture support for accelerator-rich CMPs. In Proceedings of the 49th Annual Design Automation Conference. 843--849. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jason Cong, Mohammad Ali Ghodrat, Michael Gill, Beayna Grigorian, and Glenn Reinman. 2012b. CHARM: A composable heterogeneous accelerator-rich microprocessor. In Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design. 379--384. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Gregory Diamos, Benjamin Ashbaugh, Subramaniam Maiyuran, Andrew Kerr, Haicheng Wu, and Sudhakar Yalamanchili. 2011. SIMD re-convergence at thread frontiers. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 477--488. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012a. Architecture support for disciplined approximate programming. In Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems. 301--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hadi Esmaeilzadeh, Adrian Sampson, Luis Ceze, and Doug Burger. 2012b. Neural acceleration for general-purpose approximate programs. In Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture. 449--460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. W. L. Fung and T. M. Aamodt. 2011. Thread block compaction for efficient SIMT control flow. In Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Wilson W. L. Fung, Ivan Sham, George Yuan, and Tor M. Aamodt. 2007. Dynamic warp formation and scheduling for efficient GPU control flow. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture. 407--420. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Beayna Grigorian and Glenn Reinman. 2014. Dynamically adaptive and reliable approximate computing using light-weight error analysis. In NASA/ESA Conference on Adaptive Hardware and Systems. 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  17. Michael Gschwind. 2006. Chip multiprocessing and the cell broadband engine. In Proceedings of the 3rd Conference on Computing Frontiers. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Simon Haykin. 1998. Neural Networks: A Comprehensive Foundation (2nd ed.). Prentice Hall PTR. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kurt Hornik. 1991. Approximation capabilities of multilayer feedforward networks. Neural Networks 4, 2 (March 1991), 251--257. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P3 International. 2014. Kill A Watt. http://www.p3international.com/products/p4400.html. (2014).Google ScholarGoogle Scholar
  21. Ujval J. Kapasi. 2004. Conditional Techniques for Stream Processing Kernels. Ph.D. Dissertation. Stanford University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ujval J. Kapasi, William J. Dally, Scott Rixner, Peter R. Mattson, John D. Owens, and Brucek Khailany. 2000. Efficient conditional operations for data-parallel architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yunsup Lee, Rimas Avizienis, Alex Bishara, Richard Xia, Derek Lockhart, Christopher Batten, and Krste Asanović. 2011. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In Proceedings of the 38th Annual International Symposium on Computer Architecture. 129--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. 2008. NVIDIA tesla: A unified graphics and computing architecture. IEEE Micro 28, 2 (2008), 39--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chris Lomont. 2011. Introduction to Intel advanced vector extensions. In Proceedings of the 2nd Annual ASCI Conference. 132--137.Google ScholarGoogle Scholar
  26. P. K. Meher. 2010. An optimized lookup-table for the evaluation of sigmoid function for artificial neural networks. In Proceedings of the 18th IEEE/IFIP VLSI System on Chip Conference. 91--95.Google ScholarGoogle Scholar
  27. Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic warp subdivision for integrated branch and memory divergence tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture. 235--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 308--317. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Steffen Nissen. 2003. Implementation of a Fast Artificial Neural Network Library (FANN). Report. Department of Computer Science University of Copenhagen (DIKU).Google ScholarGoogle Scholar
  30. Nvidia. 2013. GeForce GTX 480. Retrieved from http://www.geforce.com/hardware/desktop-gpus/geforce-gtx-480.Google ScholarGoogle Scholar
  31. Nvidia. 2014a. CUDA 5.5 Production Release. Retrieved from http://developer.nvidia.com/cuda-downloads.Google ScholarGoogle Scholar
  32. Nvidia. 2014b. CUDA Math Library. Retrieved from http://developer.nvidia.com/cuda-math-library.Google ScholarGoogle Scholar
  33. J. M. Ortega and W. C. Rheinboldt. 1970. Iterative Solution of Nonlinear Equations in Several Variables. Academic Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Dan Quinlan. 2000. ROSE: Compiler support for object-oriented frameworks. Parallel Processing Letters 10 (2000), 215--226.Google ScholarGoogle ScholarCross RefCross Ref
  35. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533--536.Google ScholarGoogle ScholarCross RefCross Ref
  36. Mehrzad Samadi, Davoud Anoushe Jamshidi, Janghaeng Lee, and Scott Mahlke. 2014. Paraprox: Pattern-based approximation for data parallel applications. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems. 35--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mehrzad Samadi, Janghaeng Lee, Davoud Anoushe Jamshidi, Amir Hormati, and Scott Mahlke. 2013. SAGE: Self-tuning approximation for graphics engines. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture. 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Adrian Sampson, Werner Dietl, Emily Fortuna, Danushen Gnanapragasam, Luis Ceze, and Dan Grossman. 2011. EnerJ: Approximate data types for safe and general low-power computation. In Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. 164--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jason Sanders and Edward Kandrot. 2010. CUDA by Example: An Introduction to General-Purpose GPU Programming. Addison-Wesley. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. Sartori and R. Kumar. 2013. Branch and data herding: Reducing control and memory divergence for error-tolerant GPU applications. IEEE Transactions on Multimedia 15, 2 (Feb. 2013), 279--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Larry Seiler, Doug Carmean, Eric Sprangle, Tom Forsyth, Michael Abrash, Pradeep Dubey, Stephen Junkins, Adam Lake, Jeremy Sugerman, Robert Cavin, Roger Espasa, Ed Grochowski, Toni Juan, and Pat Hanrahan. 2008. Larrabee: A many-core x86 architecture for visual computing. ACM Transactions on Graphics 27, 3 (Aug. 2008), 18:1--18:15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Stelios Sidiroglou-Douskos, Sasa Misailovic, Henry Hoffmann, and Martin Rinard. 2011. Managing performance vs. accuracy trade-offs with loop perforation. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 124--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. E. Smith, G. Faanes, and R. Sugumar. 2000. Vector instruction set support for conditional operations. In Proceedings of the 27th Annual International Symposium on Computer Architecture. 260--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Aniruddha S. Vaidya, Anahita Shayesteh, Dong Hyuk Woo, Roy Saharoy, and Mani Azimi. 2013. SIMD divergence optimization through intra-warp compaction. In Proceedings of the 40th Annual International Symposium on Computer Architecture. 368--379. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Ganesh Venkatesh, Jack Sampson, Nathan Goulding-Hotta, Sravanthi Kota Venkata, Michael Bedford Taylor, and Steven Swanson. 2011. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture. 163--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ingo Wald. 2011. Active thread compaction for GPU path tracing. In Proceedings of the ACM SIGGRAPH Symposium on High Performance Graphics. 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Haicheng Wu, Gregory Diamos, Jin Wang, Si Li, and Sudhakar Yalamanchili. 2012. Characterization and transformation of unstructured control flow in bulk synchronous GPU applications. International Journal of High Performance Computing Applications 26, 2 (2012), 170--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Guoqiang P. Zhang. 2000. Neural networks for classification: A survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 30, 4 (2000), 451--462. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating Divergent Applications on SIMD Architectures Using Neural Networks

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Architecture and Code Optimization
                  ACM Transactions on Architecture and Code Optimization  Volume 12, Issue 1
                  April 2015
                  201 pages
                  ISSN:1544-3566
                  EISSN:1544-3973
                  DOI:10.1145/2744295
                  Issue’s Table of Contents

                  Copyright © 2015 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 9 March 2015
                  • Accepted: 1 January 2015
                  • Revised: 1 December 2014
                  • Received: 1 September 2014
                  Published in taco Volume 12, Issue 1

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article
                  • Research
                  • Refereed

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader