research-article

Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

Authors:
Asif Ali Khan

TU Dresden, Germany

TU Dresden, Germany
View Profile

,
Norman A. Rink

TU Dresden, Germany

TU Dresden, Germany
View Profile

,
Fazal Hameed

TU Dresden, Germany

TU Dresden, Germany
View Profile

,
Jeronimo Castrillon

TU Dresden, Germany

TU Dresden, Germany
View Profile

LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded SystemsJune 2019Pages 5–18https://doi.org/10.1145/3316482.3326351

Published:23 June 2019Publication History

LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

Pages 5–18

ABSTRACT

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on the efficient utilization of on-chip memories. In the context of low-power embedded devices, efficient management of the memory space becomes even more crucial, in order to meet energy constraints. This work aims at investigating strategies for performance- and energy-efficient tensor contractions on embedded systems, using racetrack memory (RTM)-based scratch-pad memory (SPM). Compiler optimizations such as the loop access order and data layout transformations paired with architectural optimizations such as prefetching and preshifting are employed to reduce the shifting overhead in RTMs. Experimental results demonstrate that the proposed optimizations improve the SPM performance and energy consumption by 24% and 74% respectively compared to an iso-capacity SRAM.

References

Martín Abadi and Ashish Agarwal et al. 2015. TensorFlow: LargeScale Machine Learning on Heterogeneous Distributed Systems. http://download.tensorflow.org/paper/whitepaper2015.pdf.Google Scholar
Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2014. Compilers: Principles, Techniques, and Tools. Pearson. Google ScholarDigital Library
Ehsan Atoofian. 2015. Reducing Shift Penalty in Domain Wall Memory Through Register Locality. In Proceedings of the 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES ’15). IEEE Press, Piscataway, NJ, USA, 177–186. http://dl.acm.org/citation.cfm?id=2830689.2830711 Google ScholarDigital Library
R. Banakar, S. Steinke, Bo-Sik Lee, M. Balakrishnan, and P. Marwedel. 2002. Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In Proceedings of the Tenth International Symposium on Hardware/Software Codesign. CODES 2002 (IEEE Cat. No.02TH8627). 73–78. Google ScholarDigital Library
Gerald Baumgartner, Alexander A. Auer, David E. Bernholdt, Alina Bibireata, Venkatesh Choppella, Daniel Cociorva, Xiaoyang Gao, Robert J. Harrison, So Hirata, Sriram Krishnamoorthy, Sandhya Krishnan, Chi-Chung Lam, Qingda Lu, Marcel Nooijen, Russell M. Pitzer, J. Ramanujam, P. Sadayappan, and Alexander Sibiryakov. 2005. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE 93 (2005), 276–292.Google ScholarCross Ref
James Bergstra, Olivier Breuleux, Frédéric Bastien, Pascal Lamblin, Razvan Pascanu, Guillaume Desjardins, Joseph Turian, David WardeFarley, and Yoshua Bengio. 2010. Theano: a CPU and GPU Math Expression Compiler. In Proceedings of the Python for Scientific Computing Conference (SciPy).Google ScholarCross Ref
Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and Olivier Temam. 2014. DianNao: A Small-footprint High-throughput Accelerator for Ubiquitous Machine-learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’14). ACM, New York, NY, USA, 269–284. Google ScholarDigital Library
Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind Krishnamurthy. 2018. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). USENIX Association, Carlsbad, CA, 578–594. https://www.usenix.org/conference/osdi18/presentation/chen Google Scholar
Xianzhang Chen, Edwin Hsing-Mean Sha, Qingfeng Zhuge, Chun Jason Xue, Weiwen Jiang, and Yuangang Wang. 2016. Efficient Data Placement for Improving Data Access Performance on Domain-Wall Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 24, 10 (Oct. 2016), 3094–3104. Google ScholarDigital Library
R Clinton Whaley, Antoine Petitet, and Jack Dongarra. 2001. Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27 (01 2001), 3–35.Google Scholar
D. Coppersmith and S. Winograd. 1987. Matrix Multiplication via Arithmetic Progressions. In Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing (STOC ’87). ACM, New York, NY, USA, 1–6.Google Scholar
Paul Feautrier and Christian Lengauer. 2011. Polyhedron Model. Springer US, Boston, MA, 1581–1592.Google Scholar
Kun Feng, Cheng Xu, Wei Wang, ZhiBang Yang, and Zheng Tian. 2012. An Optimized Matrix Multiplication on ARMv 7 Architecture.Google Scholar
Roman Gareev, Tobias Grosser, and Michael Kruse. 2018. HighPerformance Generalized Tensor Operations: A Compiler-Oriented Approach. ACM Trans. Archit. Code Optim. 15, 3, Article 34 (Sept. 2018), 27 pages. Google ScholarDigital Library
Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of Highperformance Matrix Multiplication. ACM Trans. Math. Softw. 34, 3, Article 12 (May 2008), 25 pages. Google ScholarDigital Library
John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A Family of High-Performance Matrix Multiplication Algorithms. In Proceedings of the International Conference on Computational SciencesPart I (ICCS ’01). Springer-Verlag, Berlin, Heidelberg, 51–60. http: //dl.acm.org/citation.cfm?id=645455.653765 Google ScholarDigital Library
F. Hameed, A. A. Khan, and J. Castrillon. 2018. Performance and Energy-Efficient Design of STT-RAM Last-Level Cache. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 26, 6 (June 2018), 1059–1072.Google ScholarCross Ref
J. Hu, C. J. Xue, Q. Zhuge, W. Tseng, and E. H. . Sha. 2013. Data Allocation Optimization for Hybrid Scratch Pad Memory With SRAM and Nonvolatile Memory. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 21, 6 (June 2013), 1094–1102. Google ScholarDigital Library
M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2004. Banked Scratchpad Memory Management for Reducing Leakage Energy Consumption. In Proceedings of the 2004 IEEE/ACM International Conference on Computer-aided Design (ICCAD ’04). IEEE Computer Society, Washington, DC, USA, 120–124. Google ScholarDigital Library
M. Kandemir, M. J. Irwin, G. Chen, and I. Kolcu. 2005. Compilerguided leakage optimization for banked scratch-pad memories. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 13, 10 (Oct 2005), 1136–1146. Google ScholarDigital Library
M. Kandemir, J. Ramanujam, J. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh. 2001. Dynamic Management of Scratch-pad Memory Space. In Proceedings of the 38th Annual Design Automation Conference (DAC ’01). ACM, New York, NY, USA, 690–695. Google ScholarDigital Library
Asif Ali Khan, Fazal Hameed, Robin Blaesing, Stuart Parkin, and Jeronimo Castrillon. 2019. ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0. arXiv e-prints, Article arXiv:1903.03597 (Mar 2019). arXiv: cs.ET/1903.03597Google Scholar
Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, and Jeronimo Castrillon. 2019. RTSim: A Cycle-Accurate Simulator for Racetrack Memories. IEEE Computer Architecture Letters 18, 1 (Jan 2019), 43–46.Google ScholarCross Ref
Jinsung Kim, Aravind Sukumaran-Rajam, Vineeth Thumma, Sriram Krishnamoorthy, Ajay Panyala, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2019. A Code Generator for High-performance Tensor Contractions on GPUs. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2019). IEEE Press, Piscataway, NJ, USA, 85–95. http://dl.acm.org/ citation.cfm?id=3314872.3314885Google ScholarCross Ref
Fredrik Kjolstad, Shoaib Kamil, Stephen Chou, David Lugato, and Saman Amarasinghe. 2017. The Tensor Algebra Compiler. Proc. ACM Program. Lang. 1, OOPSLA, Article 77 (Oct. 2017), 29 pages. Google ScholarDigital Library
E. Kultursay, M. Kandemir, A. Sivasubramaniam, and O. Mutlu. 2013. Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative. In International Symposium on Performance Analysis of Systems and Software (ISPASS). 256–267.Google Scholar
V. B. Y. Kumar, S. Joshi, S. B. Patkar, and H. Narayanan. 2009. FPGA Based High Performance Double-Precision Matrix Multiplication. In 2009 22nd International Conference on VLSI Design. 341–346. Google ScholarDigital Library
Jakub Kurzak, Wesley Alvaro, and Jack Dongarra. 2009. Optimizing matrix multiplication for a short-vector SIMD architecture—CELL processor. Parallel Comput. 35 (03 2009), 138–150. Google ScholarDigital Library
Nikolaos Kyrtatas, Daniele G. Spampinato, and Markus Püschel. 2015. A Basic Linear Algebra Compiler for Embedded Processors. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE ’15). EDA Consortium, San Jose, CA, USA, 1054–1059. http://dl.acm.org/citation.cfm?id=2757012.2757058 Google ScholarDigital Library
C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. 1979. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw. 5, 3 (Sept. 1979), 308–323. Google ScholarDigital Library
H. Mao, C. Zhang, G. Sun, and J. Shu. 2015. Exploring data placement in racetrack memory based scratchpad memory. In 2015 IEEE NonVolatile Memory System and Applications Symposium (NVMSA). 1–5.Google Scholar
Mengjie Mao, Wujie Wen, Yaojun Zhang, Yiran Chen, and Hai Li. 2017. An Energy-Efficient GPGPU Register File Architecture Using Racetrack Memory. IEEE Trans. Comput. 66, 9 (2017), 1478–1490.Google ScholarDigital Library
Devin Matthews. 2016. High-Performance Tensor Contraction without BLAS. CoRR abs/1607.00291 (2016). arXiv: 1607.00291 http://arxiv.org/ abs/1607.00291Google Scholar
Vijay Menon and Keshav Pingali. 1999. High-level Semantic Optimization of Numerical Codes. In Proceedings of the 13th International Conference on Supercomputing (ICS ’99). ACM, New York, NY, USA, 434–443. Google ScholarDigital Library
I. Mihai Miron, T. Moore, H. Szambolics, L. Buda-Prejbeanu, S. Auffret, B. Rodmacq, S. Pizzini, J. Vogel, M. Bonfim, A. Schuhl, and G. Gaudin. 2011. Fast Current-induced Domain-wall Motion Controlled by the Rashba Effect. 10 (06 2011), 419–23.Google Scholar
S. Mittal, J. S. Vetter, and D. Li. 2015. A Survey Of Architectural Approaches for Managing Embedded DRAM and Non-Volatile OnChip Caches. IEEE Transactions on Parallel and Distributed Systems 26, 6 (June 2015), 1524–1537.Google ScholarCross Ref
S. Mittal, R. Wang, and J. Vetter. 2017. DESTINY: A Comprehensive Tool with 3D and Multi-Level Cell Memory Modeling Capability. Journal of Low Power Electronics and Applications 7, 3 (2017).Google ScholarCross Ref
Steven S. Muchnick. 1997. Advanced compiler design and implementation. Morgan Kaufmann. Google ScholarDigital Library
S. Narayanamoorthy, H. A. Moghaddam, Z. Liu, T. Park, and N. S. Kim. 2015. Energy-Efficient Approximate Multiplication for Digital Signal Processing and Classification Applications. IEEE Transactions on Very Large Scale Integration (VLSI) Systems 23, 6 (June 2015), 1180–1184.Google ScholarCross Ref
Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, and Toshitsugu Yuba. 2007. Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment. In High Performance Computing for Computational Science - VECPAR 2006. 305–318. Google ScholarDigital Library
Preeti Ranjan Panda, Nikil D. Dutt, and Alexandru Nicolau. 1997. Efficient Utilization of Scratch-Pad Memory in Embedded Processor Applications. In Proceedings of the 1997 European Conference on Design and Test (EDTC ’97). IEEE Computer Society, Washington, DC, USA, 7–11. http://dl.acm.org/citation.cfm?id=787260.787762 Google ScholarDigital Library
Stuart Parkin, Masamitsu Hayashi, and Luc Thomas. 2008. Magnetic Domain-Wall Racetrack Memory. 320 (05 2008), 190–194.Google Scholar
Stuart Parkin and See-Hun Yang. 2015. Memory on the Racetrack. 10 (03 2015), 195–198.Google Scholar
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W.Google Scholar
M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, Jianxin Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. 2005. SPIRAL: Code Generation for DSP Transforms. Proc. IEEE 93, 2 (Feb 2005), 232–275.Google Scholar
Norman A. Rink, Immo Huismann, Adilla Susungi, Jeronimo Castrillon, Jörg Stiller, Jochen Fröhlich, and Claude Tadonki. 2018. CFDlang: Highlevel Code Generation for High-order Methods in Fluid Dynamics. In Proceedings of the Real World Domain Specific Languages Workshop 2018 (RWDSL2018). ACM, New York, NY, USA, Article 5, 10 pages. Google ScholarDigital Library
Daniele G. Spampinato and Markus Püschel. 2016. A Basic Linear Algebra Compiler for Structured Matrices. In Proceedings of the 2016 International Symposium on Code Generation and Optimization (CGO ’16). ACM, New York, NY, USA, 117–127. Google ScholarDigital Library
Paul Springer and Paolo Bientinesi. 2018. Design of a HighPerformance GEMM-like Tensor-Tensor Multiplication. ACM Trans. Math. Softw. 44, 3, Article 28 (Jan. 2018), 29 pages.Google ScholarDigital Library
Z. Sun, Wenqing Wu, and Hai Li. 2013. Cross-layer racetrack memory design for ultra high density and low power consumption. In 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC). 1–6.Google ScholarDigital Library
Adilla Susungi, Norman A. Rink, Albert Cohen, Jeronimo Castrillon, and Claude Tadonki. 2018. Meta-programming for Cross-domain Tensor Optimizations. In Proceedings of the 17th ACM SIGPLAN International Conference on Generative Programming: Concepts and Experiences (GPCE 2018). ACM, New York, NY, USA, 79–92.Google ScholarDigital Library
L. Thomas, See-Hun Yang, Kwang-Su Ryu, B. Hughes, C. Rettner, Ding-Shuo Wang, Ching-Hsiang Tsai, Kuei-Hung Shen, and S. S. P. Parkin. 2011. Racetrack Memory: A high-performance, low-cost, nonvolatile memory based on magnetic domain walls. In 2011 International Electron Devices Meeting. 24.2.1–24.2.4.Google ScholarCross Ref
Sumesh Udayakumaran, Angel Dominguez, and Rajeev Barua. 2006. Dynamic Allocation for Scratch-pad Memory Using Compile-time Decisions. ACM Trans. Embed. Comput. Syst. 5, 2 (May 2006), 472–511.Google ScholarDigital Library
Nicolas Vasilache, Oleksandr Zinenko, Theodoros Theodoridis, Priya Goyal, Zachary DeVito, William S. Moses, Sven Verdoolaege, Andrew Adams, and Albert Cohen. 2018. Tensor Comprehensions: FrameworkAgnostic High-Performance Machine Learning Abstractions. CoRR abs/1802.04730 (2018). arXiv: 1802.04730 http://arxiv.org/abs/1802. 04730Google Scholar
Virginia Vassilevska Williams. 2012. Multiplying matrices faster than Coppersmith-Winograd. Proceedings of the Annual ACM Symposium on Theory of Computing, 887–898. Google ScholarDigital Library
Rangharajan Venkatesan, Vivek Kozhikkottu, Charles Augustine, Arijit Raychowdhury, Kaushik Roy, and Anand Raghunathan. 2012. TapeCache: A High Density, Energy Efficient Cache Based on Domain Wall Memory. In Proceedings of the 2012 ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED ’12). ACM, New York, NY, USA, 185–190. Google ScholarDigital Library
Danghui Wang, Lang Ma, Meng Zhang, Jianfeng An, Hai Li, and Yiran Chen. 2017. Shift-Optimized Energy-Efficient Racetrack-Based Main Memory. Journal of Circuits, Systems and Computers 27 (09 2017), 1–16.Google Scholar
Shuo Wang, Yun Liang, Chao Zhang, Xiaolong Xie, Guangyu Sun, Yongpan Liu, Yu Wang, and Xiuhong Li. 2016. Performance-centric register file design for GPUs using racetrack memory. In 2016 21st Asia and South Pacific Design Automation Conference (ASP-DAC). 25–30.Google ScholarDigital Library
Z. Wang, Z. Gu, M. Yao, and Z. Shao. 2015. Endurance-Aware Allocation of Data Variables on NVM-Based Scratchpad Memory in Real-Time Embedded Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 34, 10 (Oct 2015), 1600–1612.Google ScholarDigital Library
R. Clint Whaley and Jack J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing (SC ’98). IEEE Computer Society, Washington, DC, USA, 1–27. http://dl.acm.org/citation.cfm?id=509058.509096 Google ScholarDigital Library
H.-S. Philip Wong, Simone Raoux, Sangbum Kim, Jiale Liang, John Reifenberg, Bipin Rajendran, Mehdi Asheghi, and Kenneth Goodson. 2010. Phase Change Memory. 98 (12 2010).Google Scholar
See-Hun Yang, Kwang-Su Ryu, and Stuart Parkin. 2015. Domain-wall Velocities of up to 750 m/s Driven by Exchange-coupling Torque in Synthetic Antiferromagnets. 10 (02 2015).Google Scholar
Chao Zhang, Guangyu Sun, Weiqi Zhang, Fan Mi, Hai Li, and W. Zhao. 2015. Quantitative modeling of racetrack memory, a tradeoff among area, performance, and power. In The 20th Asia and South Pacific Design Automation Conference. 100–105.Google Scholar
Peng Zhang and Yuxiang Gao. 2015. Matrix Multiplication on HighDensity Multi-GPU Architectures: Theoretical and Experimental Investigations. Lecture Notes in Computer Science 9137, 17–30.Google ScholarCross Ref

Index Terms

Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

Recommendations

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories
Special Issue on LCETES, Part 2, Learning, Distributed, and Optimizing Compilers

Tensor contraction is a fundamental operation in many algorithms with a plethora of applications ranging from quantum chemistry over fluid dynamics and image processing to machine learning. The performance of tensor computations critically depends on ...
Read More
Access Pattern Restructuring for Memory Energy

Abstract--Improving memory energy consumption of programs that manipulate arrays is an important problem as these codes spend large amounts of energy in accessing off-chip memory. In this paper, we propose a data-driven strategy to optimize the memory ...
Read More
Optimizing Data Layout for Racetrack Memory in Embedded Systems
ASPDAC '23: Proceedings of the 28th Asia and South Pacific Design Automation Conference

Racetrack memory (RTM), which consists of multiple domain block clusters (DBC) and access ports, is a novel non-volatile memory and has potential as scratchpad memory (SPM) in embedded devices due to its high density and low access latency. However, too ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems
June 2019
218 pages
ISBN:9781450367240
DOI:10.1145/3316482
General Chair:
Jian-Jia Chen
TU Dortmund, Germany
,
Program Chair:
Aviral Shrivastava
Arizona State University, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Compiler optimization
data transformation
embedded systems
matrix multiplication
prefetching
preshifting
racetrack memory
tensor contraction
tensors
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate116of438submissions,26%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 253
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Optimizing tensor contractions for embedded devices with racetrack memory scratch-pads

LCTES 2019: Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories

Access Pattern Restructuring for Memory Energy

Optimizing Data Layout for Racetrack Memory in Embedded Systems