skip to main content
research-article
Open Access

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Published:11 October 2019Publication History
Skip Abstract Section

Abstract

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.

References

  1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Vol. 16. 265--283.Google ScholarGoogle Scholar
  2. F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society. Washington, DC, 295--305. DOI:https://doi.org/10.1109/CGO.2006.37Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Corinne Ancourt and François Irigoin. 1991. Scanning polyhedra with DO loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 39--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda. 2017. An in-depth performance characterization of CPU- and GPU-based DNN training on modern architectures. In Proceedings of the Conference on Machine Learning on HPC Environments (MLHPC’17). ACM, New York, NY, Article 8, 8 pages. DOI:https://doi.org/10.1145/3146347.3146356Google ScholarGoogle Scholar
  5. R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). 138--149. DOI:https://doi.org/10.1109/PACT.2015.17Google ScholarGoogle Scholar
  6. Lénaïc Bagnères, Oleksandr Zinenko, Stéphane Huot, and Cédric Bastoul. 2016. Opening polyhedral compiler’s black box. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, New York, NY, 128--138. DOI:https://doi.org/10.1145/2854038.2854048Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd International Conference on Supercomputing (ICS’08). ACM, New York, NY, 225--234. DOI:https://doi.org/10.1145/1375527.1375562Google ScholarGoogle Scholar
  8. Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). IEEE Computer Society, Washington, DC, 7--16. DOI:https://doi.org/10.1109/PACT.2004.11Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ulysse Beaugnon, Alexey Kravets, Sven van Haastregt, Riyadh Baghdadi, David Tweed, Javed Absar, and Anton Lokhmotov. 2014. VOBLA: A vehicle for optimized basic linear algebra. In Proceedings of the SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’14). ACM, New York, NY, 115--124. DOI:https://doi.org/10.1145/2597809.2597818Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, Article 59, 12 pages. DOI:https://doi.org/10.1145/1654059.1654119Google ScholarGoogle Scholar
  11. Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The polyhedral model is more widely applicable than you think. In Compiler Construction, Rajiv Gupta (Ed.), Vol. 6011, Lecture Notes in Computer Science.Springer, 283--303.Google ScholarGoogle Scholar
  12. Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ algorithm: A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. on Prog. Lang. Syst. 38, 3 (Apr. 2016), 12:1--12:32. DOI:https://doi.org/10.1145/2896389Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Uday Bondhugula, Sanjeeb Dash, Oktay Gunluk, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 343--352.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-level Loop Transformations. Technical Report 08-897, University of Southern California.Google ScholarGoogle Scholar
  16. Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Retrieved from: http://arxiv.org/abs/1512.01274.Google ScholarGoogle Scholar
  17. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, 578--594. Retrieved from https://www.usenix.org/conference/osdi18/presentation/chen.Google ScholarGoogle Scholar
  18. Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Proceedings of the Conference on Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3389--3400.Google ScholarGoogle Scholar
  19. R. Collobert, K. Kavukcuoglu, and C. Farabet. 2012. Implementing neural networks efficiently. In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr, and K.-R. Muller (Eds.). Springer.Google ScholarGoogle Scholar
  20. Venmugil Elango, Norm Rubin, Mahesh Ravishankar, Hariharan Sandanagobalane, and Vinod Grover. 2018. Diesel: DSL for linear algebra and neural net computations on GPUs. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL’18). ACM, New York, NY, 42--51. DOI:https://doi.org/10.1145/3211346.3211354Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hadi Esmaeilzadeh, Emily R. Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA’11). 365--376. DOI:https://doi.org/10.1145/2000064.2000108Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Prog. 21, 6 (1992), 389--420.Google ScholarGoogle ScholarCross RefCross Ref
  23. Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer, 1581--1592.Google ScholarGoogle Scholar
  24. Basilio B. Fraguela, Ganesh Bikshandi, Jia Guo, María J. Garzarán, David Padua, and Christoph von Praun. 2012. Optimization techniques for efficient HTA programs. Parallel Comput. 38, 9 (2012), 465--484. DOI:https://doi.org/10.1016/j.parco.2012.05.002Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Matteo Frigo and Steven G. Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3. IEEE, 1381--1384.Google ScholarGoogle Scholar
  26. Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Prog. 34, 3 (July 2006), 261--317. DOI:https://doi.org/10.1007/s10766-006-0012-3Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. David E. Goldberg. 1989. Genetic Algorithms in Search, Optimization and Machine Learning (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Google 2017. XLA: Domain-Specific Compiler for Linear Algebra to Optimize TensorFlow Computations. Retrieved from https://www.tensorflow.org/performance/xla.Google ScholarGoogle Scholar
  29. Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. Retrieved from http://arxiv.org/abs/1706.02677.Google ScholarGoogle Scholar
  30. Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly-performing polyhedral optimizations on a low-level intermediate representation. Parallel Proc. Lett. 22, 04 (2012), 1250010.Google ScholarGoogle ScholarCross RefCross Ref
  31. John Hennessy. 2018. The Future of Computing. Google I/O presentation. Retrieved on May 2018 from https://www.youtube.com/watch?v=Azt8Nc-mtKM.Google ScholarGoogle Scholar
  32. François Irigoin and Remi Triolet. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 319--329.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Cijo Jose, Moustpaha Cisse, and François Fleuret. 2017. Kronecker recurrent units. Retrieved from http://arxiv.org/abs/1705.10142.Google ScholarGoogle Scholar
  34. Norman P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA’17). 1--12. DOI:https://doi.org/10.1145/3079856.3080246Google ScholarGoogle Scholar
  35. Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Inc., San Francisco, CA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved from https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/.Google ScholarGoogle Scholar
  37. Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. DOI:https://doi.org/10.1145/3133901Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David I. W. Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny M. Kaufman, Gurtej Kanwar, Wojciech Matusik, and Saman Amarasinghe. 2016. Simit: A language for physical simulation. ACM Trans. Graph. 35, 2, Article 20 (Mar. 2016), 21 pages. DOI:https://doi.org/10.1145/2866569Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Martin Kong and Louis-Noël Pouchet. 2018. A performance vocabulary for affine loop transformations. Retrieved from: http://arxiv.org/abs/1811.06043.Google ScholarGoogle Scholar
  40. Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). 127--138. DOI:https://doi.org/10.1145/2462156.2462187Google ScholarGoogle Scholar
  41. Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Handwritten digit recognition with a back-propagation network. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’89). 396--404. Retrieved from http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.Google ScholarGoogle Scholar
  42. Mikel Luján, T. L. Freeman, and John R. Gurd. 2000. OoLALA: An object oriented analysis and design of numerical linear algebra. In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’00). ACM, New York, NY, 229--252. DOI:https://doi.org/10.1145/353171.353187Google ScholarGoogle Scholar
  43. Benoit Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, and Richard Lethin. 2011. R-Stream Compiler. Springer, Boston, MA, 1756--1765. DOI:https://doi.org/10.1007/978-0-387-09766-4_515Google ScholarGoogle Scholar
  44. Microsoft 2017. Microsoft Unveils Project Brainwave for Real-time AI. Retrieved from https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave.Google ScholarGoogle Scholar
  45. Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling Halide image processing pipelines. ACM Trans. Graph. 35, 4 (July 2016), 83:1--83:11. DOI:https://doi.org/10.1145/2897824.2925952Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nvidia 2017. Deploying Deep Neural Networks with Nvidia TensorRT. Retrieved from https://devblogs.nvidia.com/parallelforall/deploying-deep-learning-nvidia-tensorrt.Google ScholarGoogle Scholar
  48. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA.Google ScholarGoogle Scholar
  49. PlaidML 2018. PlaidML. Retrieved from https://www.intel.ai/plaidml/#gs.bBu0cF8W.Google ScholarGoogle Scholar
  50. Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th ACM Symposium on Principles of Programming Languages (POPL’11).Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). ACM, New York, NY, 29--38. DOI:https://doi.org/10.1145/2435264.2435273Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Benoit Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral optimization of TensorFlow computation graphs. In Proceedings of the 6th Workshop on Extreme-scale Programming Tools (ESPT’17, associated with SC’17).Google ScholarGoogle Scholar
  53. William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Prog. Lang. Syst. 16, 4 (July 1994), 1248--1278. DOI:https://doi.org/10.1145/183432.183525Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Markus Püschel, José M. F. Moura, Bryan Singer, Jianxin Xiong, Jeremy Johnson, David Padua, Manuela Veloso, and Robert W. Johnson. 2004. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. Int. J. High Perf. Comput. Appl. 18, 1 (2004), 21--45. DOI:https://doi.org/10.1177/1094342004041291Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Steffen Rendle. 2010. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). IEEE Computer Society, Washington, DC, 995--1000. DOI:https://doi.org/10.1109/ICDM.2010.127Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Nvidia Research. [n.d.]. CUB Documentation. Version 1.8.0. Retrieved from: https://nvlabs.github.io/cub.Google ScholarGoogle Scholar
  58. Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph lowering compiler techniques for neural networks. Retrieved from http://arxiv.org/abs/1805.00907.Google ScholarGoogle Scholar
  59. Mike Schroepfer. 2018. Day 2 Keynote. Facebook f8 presentation at McEnery Convention Center, San Jose, CA. Retrieved from https://developers.facebook.com/videos/f8-2018/f8-2018-day-2-keynote/.Google ScholarGoogle Scholar
  60. Michael D. Smith. 2000. Overcoming the challenges to feedback-directed optimization (keynote talk). In Proceedings of the ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization (DYNAMO’00). ACM, New York, NY, 1--11. DOI:https://doi.org/10.1145/351397.351408Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. Retrieved from http://arxiv.org/abs/1605.02688.Google ScholarGoogle Scholar
  62. Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. GRAPHITE two years after: First lessons learned from real-world polyhedral compilation. In Proceedings of the GCC Research Opportunities Workshop (GROW’10).Google ScholarGoogle Scholar
  63. Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, 209--223. DOI:https://doi.org/10.1145/2908080.2908105Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499.Google ScholarGoogle Scholar
  65. Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. Retrieved from http://arxiv.org/abs/1412.7580.Google ScholarGoogle Scholar
  66. Nicolas Vasilache, Benoît Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques.Google ScholarGoogle Scholar
  67. T. Veldhuizen and E. Gannon. 1998. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop: Object Oriented Methods for Interoperable Scientific and Engineering Computing, Michael E. Henderson, Christopher R. Anderson, and Stephen L. Lyons (Eds.). SIAM Press, 286--295.Google ScholarGoogle Scholar
  68. Sven Verdoolaege. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the 3rd International Conference on Mathematical Software (ICMS’10). Springer, Berlin, 299--302. Retrieved from http://dl.acm.org/citation.cfm?id=1888390.1888455.Google ScholarGoogle ScholarCross RefCross Ref
  69. Sven Verdoolaege. 2011. Counting affine calculator and applications. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11). DOI:https://doi.org/10.13140/RG.2.1.2959.5601Google ScholarGoogle Scholar
  70. Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule trees. In Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT’14, Associated with HiPEAC’14).Google ScholarGoogle Scholar
  72. Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. Department of Computer Science, KU Leuven, Leuven, Belgium. DOI:https://doi.org/10.13140/RG.2.2.28998.68169Google ScholarGoogle Scholar
  73. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’98). IEEE Computer Society, Washington, DC, 1--27. Retrieved from http://dl.acm.org/citation.cfm?id=509058.509096.Google ScholarGoogle Scholar
  74. Yuxin Wu and Kaiming He. 2018. Group normalization. Retrieved from http://arxiv.org/abs/1803.08494.Google ScholarGoogle Scholar
  75. Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. Retrieved from http://arxiv.org/abs/1611.05431.Google ScholarGoogle Scholar
  76. Tomofumi Yuki and Sanjay Rajopadhye. 2013. Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report CS13-105. Colorado State University. 19 pages.Google ScholarGoogle Scholar
  77. Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction. ACM, 3--13.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Architecture and Code Optimization
      ACM Transactions on Architecture and Code Optimization  Volume 16, Issue 4
      December 2019
      572 pages
      ISSN:1544-3566
      EISSN:1544-3973
      DOI:10.1145/3366460
      Issue’s Table of Contents

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 October 2019
      • Accepted: 1 August 2019
      • Revised: 1 July 2019
      • Received: 1 February 2019
      Published in taco Volume 16, Issue 4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format