skip to main content
10.1145/3316482.3326351acmconferencesArticle/Chapter ViewAbstractPublication PagescpsweekConference Proceedingsconference-collections
research-article

Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

Published:23 June 2019Publication History

ABSTRACT

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

References

  1. Martín Abadi and Ashish Agarwal et al. 2015. TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems. http://download.tensorflow.org/paper/whitepaper2015.pdf.Google ScholarGoogle Scholar
  2. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2014. Compilers: Principles, Techniques, and Tools. Pearson. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ehsan Atoofian. 2015. Reducing Shift Penalty in Domain Wall Memory Through Register Locality. In Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES ’15). IEEE Press, Piscataway, NJ, USA, 177–186. http://dl.acm.org/citation.cfm?id=2830689.2830711 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Banakar, S. Steinke, Bo-Sik Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627). 73–78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gerald Baumgartner, Alexander A. Auer, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert J. Harrison, So Hirata, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell M. Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2005. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE 93 (2005), 276–292.Google ScholarGoogle ScholarCross RefCross Ref
  6. James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. 2010. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).Google ScholarGoogle ScholarCross RefCross Ref
  7. Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 269–284. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen Google ScholarGoogle Scholar
  9. Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang. 2016. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 10 (Oct. 2016), 3094–3104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R Clinton Whaley, Antoine Petitet, and Jack Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27 (01 2001), 3–35.Google ScholarGoogle Scholar
  11. D. Coppersmith and S. Winograd. 1987. Matrix Multiplication via Arithmetic Progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC ’87). ACM, New York, NY, USA, 1–6.Google ScholarGoogle Scholar
  12. Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581–1592.Google ScholarGoogle Scholar
  13. Kun Feng, Cheng Xu, Wei Wang, ZhiBang Yang, and Zheng Tian. 2012. An Optimized Matrix Multiplication on ARMv 7 Architecture.Google ScholarGoogle Scholar
  14. Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. HighPerformance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (Sept. 2018), 27 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of Highperformance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A Family of High-Performance Matrix Multiplication Algorithms. In Proceedings of the International Conference on Computational SciencesPart I (ICCS ’01). Springer-Verlag, Berlin, Heidelberg, 51–60. http: //dl.acm.org/citation.cfm?id=645455.653765 Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. F. Hameed, A. A. Khan, and J. Castrillon. 2018. Performance and Energy-Efficient Design of STT-RAM Last-Level Cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 6 (June 2018), 1059–1072.Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Hu, C. J. Xue, Q. Zhuge, W. Tseng, and E. H. . Sha. 2013. Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 6 (June 2013), 1094–1102. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2004. Banked Scratchpad Memory Management for Reducing Leakage Energy Consumption. In Proceedings of the 2004 IEEE/ACM International Conference on Computer-aided Design (ICCAD ’04). IEEE Computer Society, Washington, DC, USA, 120–124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2005. Compilerguided leakage optimization for banked scratch-pad memories. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13, 10 (Oct 2005), 1136–1146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2001. Dynamic Management of Scratch-pad Memory Space. In Proceedings of the 38th Annual Design Automation Conference (DAC ’01). ACM, New York, NY, USA, 690–695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Asif Ali Khan, Fazal Hameed, Robin Blaesing, Stuart Parkin, and Jeronimo Castrillon. 2019. ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0. arXiv e-prints, Article arXiv:1903.03597 (Mar 2019). arXiv: cs.ET/1903.03597Google ScholarGoogle Scholar
  23. Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, and Jeronimo Castrillon. 2019. RTSim: A Cycle-Accurate Simulator for Racetrack Memories. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 43–46.Google ScholarGoogle ScholarCross RefCross Ref
  24. Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A Code Generator for High-performance Tensor Contractions on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 85–95. http://dl.acm.org/ citation.cfm?id=3314872.3314885Google ScholarGoogle ScholarCross RefCross Ref
  25. Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. 2013. Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS). 256–267.Google ScholarGoogle Scholar
  27. V. B. Y. Kumar, S. Joshi, S. B. Patkar, and H. Narayanan. 2009. FPGA Based High Performance Double-Precision Matrix Multiplication. In 2009 22nd International Conference on VLSI Design. 341–346. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jakub Kurzak, Wesley Alvaro, and Jack Dongarra. 2009. Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput. 35 (03 2009), 138–150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nikolaos Kyrtatas, Daniele G. Spampinato, and Markus Püschel. 2015. A Basic Linear Algebra Compiler for Embedded Processors. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE ’15). EDA Consortium, San Jose, CA, USA, 1054–1059. http://dl.acm.org/citation.cfm?id=2757012.2757058 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308–323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Mao, C. Zhang, G. Sun, and J. Shu. 2015. Exploring data placement in racetrack memory based scratchpad memory. In 2015 IEEE NonVolatile Memory System and Applications Symposium (NVMSA). 1–5.Google ScholarGoogle Scholar
  32. Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory. IEEE Trans. Comput. 66, 9 (2017), 1478–1490.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Devin Matthews. 2016. High-Performance Tensor Contraction without BLAS. CoRR abs/1607.00291 (2016). arXiv: 1607.00291 http://arxiv.org/ abs/1607.00291Google ScholarGoogle Scholar
  34. Vijay Menon and Keshav Pingali. 1999. High-level Semantic Optimization of Numerical Codes. In Proceedings of the 13th International Conference on Supercomputing (ICS ’99). ACM, New York, NY, USA, 434–443. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. I. Mihai Miron, T. Moore, H. Szambolics, L. Buda-Prejbeanu, S. Auffret, B. Rodmacq, S. Pizzini, J. Vogel, M. Bonfim, A. Schuhl, and G. Gaudin. 2011. Fast Current-induced Domain-wall Motion Controlled by the Rashba Effect. 10 (06 2011), 419–23.Google ScholarGoogle Scholar
  36. S. Mittal, J. S. Vetter, and D. Li. 2015. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile OnChip Caches. IEEE Transactions on Parallel and Distributed Systems 26, 6 (June 2015), 1524–1537.Google ScholarGoogle ScholarCross RefCross Ref
  37. S. Mittal, R. Wang, and J. Vetter. 2017. DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability. Journal of Low Power Electronics and Applications 7, 3 (2017).Google ScholarGoogle ScholarCross RefCross Ref
  38. Steven S. Muchnick. 1997. Advanced compiler design and implementation. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim. 2015. Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 6 (June 2015), 1180–1184.Google ScholarGoogle ScholarCross RefCross Ref
  40. Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, and Toshitsugu Yuba. 2007. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment. In High Performance Computing for Computational Science - VECPAR 2006. 305–318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC ’97). IEEE Computer Society, Washington, DC, USA, 7–11. http://dl.acm.org/citation.cfm?id=787260.787762 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Stuart Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic Domain-Wall Racetrack Memory. 320 (05 2008), 190–194.Google ScholarGoogle Scholar
  43. Stuart Parkin and See-Hun Yang. 2015. Memory on the Racetrack. 10 (03 2015), 195–198.Google ScholarGoogle Scholar
  44. Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google ScholarGoogle Scholar
  45. M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, Jianxin Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93, 2 (Feb 2005), 232–275.Google ScholarGoogle Scholar
  46. Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, and Claude Tadonki. 2018. CFDlang: Highlevel Code Generation for High-order Methods in Fluid Dynamics. In Proceedings of the Real World Domain Specific Languages Workshop 2018 (RWDSL2018). ACM, New York, NY, USA, Article 5, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Daniele G. Spampinato and Markus Püschel. 2016. A Basic Linear Algebra Compiler for Structured Matrices. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 117–127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Paul Springer and Paolo Bientinesi. 2018. Design of a HighPerformance GEMM-like Tensor-Tensor Multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan. 2018), 29 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Z. Sun, Wenqing Wu, and Hai Li. 2013. Cross-layer racetrack memory design for ultra high density and low power consumption. In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, and Claude Tadonki. 2018. Meta-programming for Cross-domain Tensor Optimizations. In Proceedings of the 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE 2018). ACM, New York, NY, USA, 79–92.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. L. Thomas, See-Hun Yang, Kwang-Su Ryu, B. Hughes, C. Rettner, Ding-Shuo Wang, Ching-Hsiang Tsai, Kuei-Hung Shen, and S. S. P. Parkin. 2011. Racetrack Memory: A high-performance, low-cost, nonvolatile memory based on magnetic domain walls. In 2011 International Electron Devices Meeting. 24.2.1–24.2.4.Google ScholarGoogle ScholarCross RefCross Ref
  52. Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic Allocation for Scratch-pad Memory Using Compile-time Decisions. ACM Trans. Embed. Comput. Syst. 5, 2 (May 2006), 472–511.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: FrameworkAgnostic High-Performance Machine Learning Abstractions. CoRR abs/1802.04730 (2018). arXiv: 1802.04730 http://arxiv.org/abs/1802. 04730Google ScholarGoogle Scholar
  54. Virginia Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. Proceedings of the Annual ACM Symposium on Theory of Computing, 887–898. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A High Density, Energy Efficient Cache Based on Domain Wall Memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED ’12). ACM, New York, NY, USA, 185–190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Danghui Wang, Lang Ma, Meng Zhang, Jianfeng An, Hai Li, and Yiran Chen. 2017. Shift-Optimized Energy-Efficient Racetrack-Based Main Memory. Journal of Circuits, Systems and Computers 27 (09 2017), 1–16.Google ScholarGoogle Scholar
  57. Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun, Yongpan Liu, Yu Wang, and Xiuhong Li. 2016. Performance-centric register file design for GPUs using racetrack memory. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 25–30.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Z. Wang, Z. Gu, M. Yao, and Z. Shao. 2015. Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (Oct 2015), 1600–1612.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. R. Clint Whaley and Jack J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC ’98). IEEE Computer Society, Washington, DC, USA, 1–27. http://dl.acm.org/citation.cfm?id=509058.509096 Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. H.-S. Philip Wong, Simone Raoux, Sangbum Kim, Jiale Liang, John Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth Goodson. 2010. Phase Change Memory. 98 (12 2010).Google ScholarGoogle Scholar
  61. See-Hun Yang, Kwang-Su Ryu, and Stuart Parkin. 2015. Domain-wall Velocities of up to 750 m/s Driven by Exchange-coupling Torque in Synthetic Antiferromagnets. 10 (02 2015).Google ScholarGoogle Scholar
  62. Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and W. Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In The 20th Asia and South Pacific Design Automation Conference. 100–105.Google ScholarGoogle Scholar
  63. Peng Zhang and Yuxiang Gao. 2015. Matrix Multiplication on HighDensity Multi-GPU Architectures: Theoretical and Experimental Investigations. Lecture Notes in Computer Science 9137, 17–30.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems
          June 2019
          218 pages
          ISBN:9781450367240
          DOI:10.1145/3316482

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 June 2019

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate116of438submissions,26%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader