The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

Authors:
Nicolas Vasilache

Facebook AI Research, NY, USA

Facebook AI Research, NY, USA
View Profile

,
Oleksandr Zinenko

Inria and ENS, Paris, France

Inria and ENS, Paris, France

0000-0003-1978-0222
View Profile

,
Theodoros Theodoridis

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zürich, Switzerland
View Profile

,
Priya Goyal

Facebook AI Research, New York City, NY, USA

Facebook AI Research, New York City, NY, USA
View Profile

,
Zachary Devito

Facebook AI Research, Menlo Park, CA, USA

Facebook AI Research, Menlo Park, CA, USA
View Profile

,
William S. Moses

MIT CSAIL, Cambridge, MA, USA

MIT CSAIL, Cambridge, MA, USA
View Profile

,
Sven Verdoolaege

Polly Labs 8 Facebook AI Research, Leuven, Belgium

Polly Labs 8 Facebook AI Research, Leuven, Belgium
View Profile

,
Andrew Adams

Facebook AI Research, Menlo Park, CA, USA

Facebook AI Research, Menlo Park, CA, USA
View Profile

,
Albert Cohen

Inria, ENS and Facebook AI Research, Paris, France

Inria, ENS and Facebook AI Research, Paris, France

0000-0002-8866-5343
View Profile

ACM Transactions on Architecture and Code Optimization Volume 16 Issue 4Article No.: 38pp 1–26https://doi.org/10.1145/3355606

Published:11 October 2019Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Deep learning frameworks automate the deployment, distribution, synchronization, memory allocation, and hardware acceleration of models represented as graphs of computational operators. These operators wrap high-performance libraries such as cuDNN or NNPACK. When the computation does not match any predefined library call, custom operators must be implemented, often at high engineering cost and performance penalty, limiting the pace of innovation. To address this productivity gap, we propose and evaluate: (1) a domain-specific language with a tensor notation close to the mathematics of deep learning; (2) a Just-In-Time optimizing compiler based on the polyhedral framework; (3) carefully coordinated linear optimization and evolutionary algorithms to synthesize high-performance CUDA kernels; (4) the transparent integration of our flow into PyTorch and Caffe2, providing the fully automatic synthesis of high-performance GPU kernels from simple tensor algebra. The performance is comparable to, and often exceeds the performance of, highly tuned libraries.

References

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Vol. 16. 265--283.Google Scholar
F. Agakov, E. Bonilla, J. Cavazos, B. Franke, G. Fursin, M. F. P. O’Boyle, J. Thomson, M. Toussaint, and C. K. I. Williams. 2006. Using machine learning to focus iterative optimization. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’06). IEEE Computer Society. Washington, DC, 295--305. DOI:https://doi.org/10.1109/CGO.2006.37Google ScholarDigital Library
Corinne Ancourt and François Irigoin. 1991. Scanning polyhedra with DO loops. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, 39--50.Google ScholarDigital Library
Ammar Ahmad Awan, Hari Subramoni, and Dhabaleswar K. Panda. 2017. An in-depth performance characterization of CPU- and GPU-based DNN training on modern architectures. In Proceedings of the Conference on Machine Learning on HPC Environments (MLHPC’17). ACM, New York, NY, Article 8, 8 pages. DOI:https://doi.org/10.1145/3146347.3146356Google Scholar
R. Baghdadi, U. Beaugnon, A. Cohen, T. Grosser, M. Kruse, C. Reddy, S. Verdoolaege, A. Betts, A. F. Donaldson, J. Ketema, J. Absar, S. V. Haastregt, A. Kravets, A. Lokhmotov, R. David, and E. Hajiyev. 2015. PENCIL: A platform-neutral compute intermediate language for accelerator programming. In Proceedings of the International Conference on Parallel Architecture and Compilation (PACT’15). 138--149. DOI:https://doi.org/10.1109/PACT.2015.17Google Scholar
Lénaïc Bagnères, Oleksandr Zinenko, Stéphane Huot, and Cédric Bastoul. 2016. Opening polyhedral compiler’s black box. In Proceedings of the International Symposium on Code Generation and Optimization (CGO’16). ACM, New York, NY, 128--138. DOI:https://doi.org/10.1145/2854038.2854048Google ScholarDigital Library
Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. 2008. A compiler framework for optimization of affine loop nests for GPGPUs. In Proceedings of the 22nd International Conference on Supercomputing (ICS’08). ACM, New York, NY, 225--234. DOI:https://doi.org/10.1145/1375527.1375562Google Scholar
Cédric Bastoul. 2004. Code generation in the polyhedral model is easier than you think. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT’04). IEEE Computer Society, Washington, DC, 7--16. DOI:https://doi.org/10.1109/PACT.2004.11Google ScholarDigital Library
Ulysse Beaugnon, Alexey Kravets, Sven van Haastregt, Riyadh Baghdadi, David Tweed, Javed Absar, and Anton Lokhmotov. 2014. VOBLA: A vehicle for optimized basic linear algebra. In Proceedings of the SIGPLAN/SIGBED Conference on Languages, Compilers and Tools for Embedded Systems (LCTES’14). ACM, New York, NY, 115--124. DOI:https://doi.org/10.1145/2597809.2597818Google ScholarDigital Library
Geoffrey Belter, E. R. Jessup, Ian Karlin, and Jeremy G. Siek. 2009. Automating the generation of composed linear algebra kernels. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC’09). ACM, New York, NY, Article 59, 12 pages. DOI:https://doi.org/10.1145/1654059.1654119Google Scholar
Mohamed-Walid Benabderrahmane, Louis-Noël Pouchet, Albert Cohen, and Cédric Bastoul. 2010. The polyhedral model is more widely applicable than you think. In Compiler Construction, Rajiv Gupta (Ed.), Vol. 6011, Lecture Notes in Computer Science.Springer, 283--303.Google Scholar
Uday Bondhugula, Aravind Acharya, and Albert Cohen. 2016. The Pluto+ algorithm: A practical approach for parallelization and locality optimization of affine loop nests. ACM Trans. on Prog. Lang. Syst. 38, 3 (Apr. 2016), 12:1--12:32. DOI:https://doi.org/10.1145/2896389Google ScholarDigital Library
Uday Bondhugula, Sanjeeb Dash, Oktay Gunluk, and Lakshminarayanan Renganarayanan. 2010. A model for fusion and code motion in an automatic parallelizing compiler. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT’10). IEEE, 343--352.Google ScholarDigital Library
Uday Bondhugula, Albert Hartono, Jagannathan Ramanujam, and Ponnuswamy Sadayappan. 2008. A practical automatic polyhedral parallelizer and locality optimizer. In Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarDigital Library
Chun Chen, Jacqueline Chame, and Mary Hall. 2008. CHiLL: A Framework for Composing High-level Loop Transformations. Technical Report 08-897, University of Southern California.Google Scholar
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. 2015. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. Retrieved from: http://arxiv.org/abs/1512.01274.Google Scholar
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). USENIX Association, 578--594. Retrieved from https://www.usenix.org/conference/osdi18/presentation/chen.Google Scholar
Tianqi Chen, Lianmin Zheng, Eddie Yan, Ziheng Jiang, Thierry Moreau, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. Learning to optimize tensor programs. In Proceedings of the Conference on Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 3389--3400.Google Scholar
R. Collobert, K. Kavukcuoglu, and C. Farabet. 2012. Implementing neural networks efficiently. In Neural Networks: Tricks of the Trade, G. Montavon, G. Orr, and K.-R. Muller (Eds.). Springer.Google Scholar
Venmugil Elango, Norm Rubin, Mahesh Ravishankar, Hariharan Sandanagobalane, and Vinod Grover. 2018. Diesel: DSL for linear algebra and neural net computations on GPUs. In Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL’18). ACM, New York, NY, 42--51. DOI:https://doi.org/10.1145/3211346.3211354Google ScholarDigital Library
Hadi Esmaeilzadeh, Emily R. Blem, Renée St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th International Symposium on Computer Architecture (ISCA’11). 365--376. DOI:https://doi.org/10.1145/2000064.2000108Google ScholarDigital Library
Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Prog. 21, 6 (1992), 389--420.Google ScholarCross Ref
Paul Feautrier and Christian Lengauer. 2011. Polyhedron model. In Encyclopedia of Parallel Computing, David Padua (Ed.). Springer, 1581--1592.Google Scholar
Basilio B. Fraguela, Ganesh Bikshandi, Jia Guo, María J. Garzarán, David Padua, and Christoph von Praun. 2012. Optimization techniques for efficient HTA programs. Parallel Comput. 38, 9 (2012), 465--484. DOI:https://doi.org/10.1016/j.parco.2012.05.002Google ScholarDigital Library
Matteo Frigo and Steven G. Johnson. 1998. FFTW: An adaptive software architecture for the FFT. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 3. IEEE, 1381--1384.Google Scholar
Sylvain Girbal, Nicolas Vasilache, Cédric Bastoul, Albert Cohen, David Parello, Marc Sigler, and Olivier Temam. 2006. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Prog. 34, 3 (July 2006), 261--317. DOI:https://doi.org/10.1007/s10766-006-0012-3Google ScholarDigital Library
David E. Goldberg. 1989. Genetic Algorithms in Search, Optimization and Machine Learning (1st ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA.Google ScholarDigital Library
Google 2017. XLA: Domain-Specific Compiler for Linear Algebra to Optimize TensorFlow Computations. Retrieved from https://www.tensorflow.org/performance/xla.Google Scholar
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. Retrieved from http://arxiv.org/abs/1706.02677.Google Scholar
Tobias Grosser, Armin Groesslinger, and Christian Lengauer. 2012. Polly-performing polyhedral optimizations on a low-level intermediate representation. Parallel Proc. Lett. 22, 04 (2012), 1250010.Google ScholarCross Ref
John Hennessy. 2018. The Future of Computing. Google I/O presentation. Retrieved on May 2018 from https://www.youtube.com/watch?v=Azt8Nc-mtKM.Google Scholar
François Irigoin and Remi Triolet. 1988. Supernode partitioning. In Proceedings of the 15th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages. ACM, 319--329.Google ScholarDigital Library
Cijo Jose, Moustpaha Cisse, and François Fleuret. 2017. Kronecker recurrent units. Retrieved from http://arxiv.org/abs/1705.10142.Google Scholar
Norman P. Jouppi et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th International Symposium on Computer Architecture (ISCA’17). 1--12. DOI:https://doi.org/10.1145/3079856.3080246Google Scholar
Ken Kennedy and John R. Allen. 2002. Optimizing Compilers for Modern Architectures: A Dependence-based Approach. Morgan Kaufmann Publishers, Inc., San Francisco, CA.Google ScholarDigital Library
Andrew Kerr, Duane Merrill, Julien Demouth, and John Tran. 2017. CUTLASS: Fast Linear Algebra in CUDA C++. Retrieved from https://devblogs.nvidia.com/cutlass-linear-algebra-cuda/.Google Scholar
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The tensor algebra compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. DOI:https://doi.org/10.1145/3133901Google ScholarDigital Library
Fredrik Kjolstad, Shoaib Kamil, Jonathan Ragan-Kelley, David I. W. Levin, Shinjiro Sueda, Desai Chen, Etienne Vouga, Danny M. Kaufman, Gurtej Kanwar, Wojciech Matusik, and Saman Amarasinghe. 2016. Simit: A language for physical simulation. ACM Trans. Graph. 35, 2, Article 20 (Mar. 2016), 21 pages. DOI:https://doi.org/10.1145/2866569Google ScholarDigital Library
Martin Kong and Louis-Noël Pouchet. 2018. A performance vocabulary for affine loop transformations. Retrieved from: http://arxiv.org/abs/1811.06043.Google Scholar
Martin Kong, Richard Veras, Kevin Stock, Franz Franchetti, Louis-Noël Pouchet, and P. Sadayappan. 2013. When polyhedral transformations meet SIMD code generation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’13). 127--138. DOI:https://doi.org/10.1145/2462156.2462187Google Scholar
Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E. Hubbard, and Lawrence D. Jackel. 1989. Handwritten digit recognition with a back-propagation network. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS’89). 396--404. Retrieved from http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.Google Scholar
Mikel Luján, T. L. Freeman, and John R. Gurd. 2000. OoLALA: An object oriented analysis and design of numerical linear algebra. In Proceedings of the 15th ACM SIGPLAN Conference on Object-oriented Programming, Systems, Languages, and Applications (OOPSLA’00). ACM, New York, NY, 229--252. DOI:https://doi.org/10.1145/353171.353187Google Scholar
Benoit Meister, Nicolas Vasilache, David Wohlford, Muthu Manikandan Baskaran, Allen Leung, and Richard Lethin. 2011. R-Stream Compiler. Springer, Boston, MA, 1756--1765. DOI:https://doi.org/10.1007/978-0-387-09766-4_515Google Scholar
Microsoft 2017. Microsoft Unveils Project Brainwave for Real-time AI. Retrieved from https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave.Google Scholar
Ravi Teja Mullapudi, Andrew Adams, Dillon Sharlet, Jonathan Ragan-Kelley, and Kayvon Fatahalian. 2016. Automatically scheduling Halide image processing pipelines. ACM Trans. Graph. 35, 4 (July 2016), 83:1--83:11. DOI:https://doi.org/10.1145/2897824.2925952Google ScholarDigital Library
Ravi Teja Mullapudi, Vinay Vasista, and Uday Bondhugula. 2015. PolyMage: Automatic optimization for image processing pipelines. In Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’15). ACM, New York, NY, 429--443. DOI:https://doi.org/10.1145/2694344.2694364Google ScholarDigital Library
Nvidia 2017. Deploying Deep Neural Networks with Nvidia TensorRT. Retrieved from https://devblogs.nvidia.com/parallelforall/deploying-deep-learning-nvidia-tensorrt.Google Scholar
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS 2017 Autodiff Workshop: The Future of Gradient-based Machine Learning Software and Techniques, Long Beach, CA.Google Scholar
PlaidML 2018. PlaidML. Retrieved from https://www.intel.ai/plaidml/#gs.bBu0cF8W.Google Scholar
Louis-Noël Pouchet, Uday Bondhugula, Cédric Bastoul, Albert Cohen, J. Ramanujam, P. Sadayappan, and Nicolas Vasilache. 2011. Loop transformations: Convexity, pruning and optimization. In Proceedings of the 38th ACM Symposium on Principles of Programming Languages (POPL’11).Google ScholarDigital Library
Louis-Noel Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. 2013. Polyhedral-based data reuse optimization for configurable computing. In Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’13). ACM, New York, NY, 29--38. DOI:https://doi.org/10.1145/2435264.2435273Google ScholarDigital Library
Benoit Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer, and Richard Lethin. 2017. Polyhedral optimization of TensorFlow computation graphs. In Proceedings of the 6th Workshop on Extreme-scale Programming Tools (ESPT’17, associated with SC’17).Google Scholar
William Pugh and David Wonnacott. 1994. Static analysis of upper and lower bounds on dependences and parallelism. ACM Trans. Prog. Lang. Syst. 16, 4 (July 1994), 1248--1278. DOI:https://doi.org/10.1145/183432.183525Google ScholarDigital Library
Markus Püschel, José M. F. Moura, Bryan Singer, Jianxin Xiong, Jeremy Johnson, David Padua, Manuela Veloso, and Robert W. Johnson. 2004. Spiral: A generator for platform-adapted libraries of signal processing alogorithms. Int. J. High Perf. Comput. Appl. 18, 1 (2004), 21--45. DOI:https://doi.org/10.1177/1094342004041291Google ScholarDigital Library
Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation.Google ScholarDigital Library
Steffen Rendle. 2010. Factorization machines. In Proceedings of the IEEE International Conference on Data Mining (ICDM’10). IEEE Computer Society, Washington, DC, 995--1000. DOI:https://doi.org/10.1109/ICDM.2010.127Google ScholarDigital Library
Nvidia Research. [n.d.]. CUB Documentation. Version 1.8.0. Retrieved from: https://nvlabs.github.io/cub.Google Scholar
Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov, James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park, Artem Rakhov, and Misha Smelyanskiy. 2018. Glow: Graph lowering compiler techniques for neural networks. Retrieved from http://arxiv.org/abs/1805.00907.Google Scholar
Mike Schroepfer. 2018. Day 2 Keynote. Facebook f8 presentation at McEnery Convention Center, San Jose, CA. Retrieved from https://developers.facebook.com/videos/f8-2018/f8-2018-day-2-keynote/.Google Scholar
Michael D. Smith. 2000. Overcoming the challenges to feedback-directed optimization (keynote talk). In Proceedings of the ACM SIGPLAN Workshop on Dynamic and Adaptive Compilation and Optimization (DYNAMO’00). ACM, New York, NY, 1--11. DOI:https://doi.org/10.1145/351397.351408Google ScholarDigital Library
Theano Development Team. 2016. Theano: A Python framework for fast computation of mathematical expressions. Retrieved from http://arxiv.org/abs/1605.02688.Google Scholar
Konrad Trifunovic, Albert Cohen, David Edelsohn, Feng Li, Tobias Grosser, Harsha Jagasia, Razya Ladelsky, Sebastian Pop, Jan Sjödin, and Ramakrishna Upadrasta. 2010. GRAPHITE two years after: First lessons learned from real-world polyhedral compilation. In Proceedings of the GCC Research Opportunities Workshop (GROW’10).Google Scholar
Leonard Truong, Rajkishore Barik, Ehsan Totoni, Hai Liu, Chick Markley, Armando Fox, and Tatiana Shpeisman. 2016. Latte: A language, compiler, and runtime for elegant and efficient deep neural networks. In Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’16). ACM, New York, NY, 209--223. DOI:https://doi.org/10.1145/2908080.2908105Google ScholarDigital Library
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A generative model for raw audio. Retrieved from http://arxiv.org/abs/1609.03499.Google Scholar
Nicolas Vasilache, Jeff Johnson, Michaël Mathieu, Soumith Chintala, Serkan Piantino, and Yann LeCun. 2014. Fast convolutional nets with fbfft: A GPU performance evaluation. Retrieved from http://arxiv.org/abs/1412.7580.Google Scholar
Nicolas Vasilache, Benoît Meister, Muthu Baskaran, and Richard Lethin. 2012. Joint scheduling and layout optimization to enable multi-level vectorization. In Proceedings of the 2nd International Workshop on Polyhedral Compilation Techniques.Google Scholar
T. Veldhuizen and E. Gannon. 1998. Active libraries: Rethinking the roles of compilers and libraries. In Proceedings of the SIAM Workshop: Object Oriented Methods for Interoperable Scientific and Engineering Computing, Michael E. Henderson, Christopher R. Anderson, and Stephen L. Lyons (Eds.). SIAM Press, 286--295.Google Scholar
Sven Verdoolaege. 2010. Isl: An integer set library for the polyhedral model. In Proceedings of the 3rd International Conference on Mathematical Software (ICMS’10). Springer, Berlin, 299--302. Retrieved from http://dl.acm.org/citation.cfm?id=1888390.1888455.Google ScholarCross Ref
Sven Verdoolaege. 2011. Counting affine calculator and applications. In Proceedings of the 1st International Workshop on Polyhedral Compilation Techniques (IMPACT’11). DOI:https://doi.org/10.13140/RG.2.1.2959.5601Google Scholar
Sven Verdoolaege, Juan Carlos Juega, Albert Cohen, José Ignacio Gómez, Christian Tenllado, and Francky Catthoor. 2013. Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim. 9, 4 (Jan. 2013), 54:1--54:23. DOI:https://doi.org/10.1145/2400682.2400713Google ScholarDigital Library
Sven Verdoolaege, Serge Guelton, Tobias Grosser, and Albert Cohen. 2014. Schedule trees. In Proceedings of the 4th Workshop on Polyhedral Compilation Techniques (IMPACT’14, Associated with HiPEAC’14).Google Scholar
Sven Verdoolaege and Gerda Janssens. 2017. Scheduling for PPCG. Report CW 706. Department of Computer Science, KU Leuven, Leuven, Belgium. DOI:https://doi.org/10.13140/RG.2.2.28998.68169Google Scholar
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically tuned linear algebra software. In Proceedings of the ACM/IEEE Conference on Supercomputing (SC’98). IEEE Computer Society, Washington, DC, 1--27. Retrieved from http://dl.acm.org/citation.cfm?id=509058.509096.Google Scholar
Yuxin Wu and Kaiming He. 2018. Group normalization. Retrieved from http://arxiv.org/abs/1803.08494.Google Scholar
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2016. Aggregated residual transformations for deep neural networks. Retrieved from http://arxiv.org/abs/1611.05431.Google Scholar
Tomofumi Yuki and Sanjay Rajopadhye. 2013. Parametrically Tiled Distributed Memory Parallelization of Polyhedral Programs. Technical Report CS13-105. Colorado State University. 19 pages.Google Scholar
Oleksandr Zinenko, Sven Verdoolaege, Chandan Reddy, Jun Shirako, Tobias Grosser, Vivek Sarkar, and Albert Cohen. 2018. Modeling the conflicting demands of parallelism and Temporal/Spatial locality in affine scheduling. In Proceedings of the 27th International Conference on Compiler Construction. ACM, 3--13.Google ScholarDigital Library

Index Terms

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Application Performance on the Newest Processors and GPUs
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing

This paper discusses the capabilities of the newest processors and GPUs to run a mixture of the most common chemistry applications. The baseline system for these comparisons is the 32-core Intel Broadwell processor which has been around for two years. ...
Read More
An Implementation of GPU Accelerated MapReduce: Using Hadoop with OpenCL for Data- and Compute-Intensive Jobs
IJCSS '12: Proceedings of the 2012 International Joint Conference on Service Sciences

MapReduce is an efficient distributed computing model for large-scale data processing. However, single-node performance is gradually to be the bottleneck in compute-intensive jobs. This paper presents an approach of MapReduce improvement with GPU ...
Read More
GPU Accelerated Wavelet Transform Profilometry
CSO '12: Proceedings of the 2012 Fifth International Joint Conference on Computational Sciences and Optimization

Huge workload and time-consuming of the phase computation based on the Wavelet Transform Profilometry (WTP) so that not meet real-time three-dimensional (3D) measurement needs. Fortunately the pixels which in situ need to be processed and the already ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Architecture and Code Optimization Volume 16, Issue 4
December 2019
572 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/3366460
Editor:
Koen De Bosschere
Ghent University, Belgium
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 October 2019
- Accepted: 1 August 2019
- Revised: 1 July 2019
- Received: 1 February 2019
Published in taco Volume 16, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep learning layers
GPU acceleration
polyhedral compilation
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 2,377
  Total Downloads
- Downloads (Last 12 months)374
- Downloads (Last 6 weeks)43
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

The Next 700 Accelerated Layers: From Mathematical Expressions of Network Computation Graphs to Accelerated GPU Kernels, Automatically

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Application Performance on the Newest Processors and GPUs

An Implementation of GPU Accelerated MapReduce: Using Hadoop with OpenCL for Data- and Compute-Intensive Jobs

GPU Accelerated Wavelet Transform Profilometry