research-article

Public Access

Understanding Energy Aspects of Processing-near-Memory for HPC Workloads

Authors:

Sudhakar Yalamanchili,

Arun F. RodriguesAuthors Info & Claims

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

Pages 276 - 282

https://doi.org/10.1145/2818950.2818985

Published: 05 October 2015 Publication History

Abstract

Interests in the concept of processing-near-memory (PNM) have been reignited with recent improvements of the 3D integration technology. In this work, we analyze the energy consumption characteristics of a system which comprises a conventional processor and a 3D memory stack with fully-programmable cores. We construct a high-level analytical energy model based on the underlying architecture and the technology with which each component is built. From the preliminary experiments with 11 HPC benchmarks from Mantevo benchmark suite, we observed that misses per kilo instructions (MPKI) of last-level cache (LLC) is one of the most important characteristics in determining the friendliness of the application to the PNM execution.

References

[1]

Haswell-Based Xeon E3-1200. http://goo.gl/EDF3nh.

[2]

Inside the HMC. http://goo.gl/DYoMY4.

[3]

Intel Xeon Processor E3-1275. http://goo.gl/EjmNJd.

[4]

MacSim Simulator. https://goo.gl/gkosY6.

[5]

Mantevo. https://mantevo.org/.

[6]

D. Elliott, M. Stumm, W. Snelgrove, C. Cojocaru, and R. McKenzie. Computational RAM: implementing processors in memory. 16(1):32--41, Jan 1999.

Digital Library

[7]

K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. In Computer Architecture, 2002. Proceedings. 29th Annual International Symposium on, pages 148--157, 2002.

Digital Library

[8]

J. Jeddeloh and B. Keeth. Hybrid memory cube new DRAM architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87--88, June 2012.

[9]

Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas. FlexRAM: toward an advanced intelligent memory system. In Computer Design, 1999. (ICCD '99) International Conference on, pages 192--201, 1999.

Digital Library

[10]

G. Kim, J. Kim, J. H. Ahn, and J. Kim. Memory-centric system interconnect design with hybrid memory cubes. In Parallel Architectures and Compilation Techniques (PACT), 2013 22nd International Conference on, pages 145--155, Sept 2013.

Digital Library

[11]

N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi. CACTI 6.0: A Tool to Understand Large Caches, 2009.

[12]

A. Naveh, E. Rotem, A. Mendelson, S. Gochman, R. Chabukswar, K. Krishnan, and A. Kumar. Power and thermal management in the Intel core duo processor. Intel Technology Journal, 10(2), 2006.

[13]

S. Pugsley, J. Jestes, H. Zhang, R. Balasubramonian, V. Srinivasan, A. Buyuktosunoglu, A. Davis, and F. Li. NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads. In Performance Analysis of Systems and Software (ISPASS), 2014 IEEE International Symposium on, pages 190--200, March 2014.

[14]

A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. CooperBalls, and B. Jacob. The structural simulation toolkit. SIGMETRICS Perform. Eval. Rev., 38(4):37--42, Mar. 2011.

Digital Library

[15]

P. Rosenfeld. Performance Exploration of the Hybrid Memory Cube. Ph.D. dissertation, University of Maryland, College Park, 2014.

[16]

P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A Cycle Accurate Memory System Simulator. IEEE Comput. Archit. Lett., 10(1):16--19, Jan. 2011.

Digital Library

[17]

G. Sandhu. DRAM Scaling & Bandwidth Challenges. In NSF Workshop on Emerging Technologies for Interconnects (WETI), February 2012.

[18]

T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS X, pages 45--57, New York, NY, USA, 2002. ACM.

Digital Library

[19]

A. N. Udipi, N. Muralimanohar, R. Balasubramonian, A. Davis, and N. P. Jouppi. Combining Memory and a Controller with Photonics Through 3D-stacking to Enable Scalable and Energy-efficient Systems. In Proceedings of the 38th Annual International Symposium on Computer Architecture, ISCA '11, pages 425--436, New York, NY, USA, 2011. ACM.

Digital Library

[20]

A. N. Udipi, N. Muralimanohar, N. Chatterjee, R. Balasubramonian, A. Davis, and N. P. Jouppi. Rethinking DRAM Design and Organization for Energy-constrained Multi-cores. In Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA '10, pages 175--186, New York, NY, USA, 2010. ACM.

Digital Library

[21]

D. H. Woo, N. H. Seong, D. Lewis, and H.-H. Lee. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In High Performance Computer Architecture (HPCA), 2010 IEEE 16th International Symposium on, pages 1--12, Jan 2010.

[22]

D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. TOP-PIM: Throughput-oriented Programmable Processing in Memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 85--98, New York, NY, USA, 2014. ACM.

Digital Library

[23]

D. P. Zhang, N. Jayasena, A. Lyashevsky, J. Greathouse, M. Meswani, M. Nutter, and M. Ignatowski. A New Perspective on Processing-in-memory Architecture Design. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '13, pages 7:1--7:3, New York, NY, USA, 2013. ACM.

Digital Library

Cited By

Do TPottier Lda Silva RSuter FCaino-Lores STaufer MDeelman E(2022)Co-scheduling Ensembles of In Situ Workflows2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS56498.2022.00011(43-51)Online publication date: Nov-2022
https://doi.org/10.1109/WORKS56498.2022.00011
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Kim BRhee C(2020)Making Better Use of Processing-in-Memory Through Potential-Based Task OffloadingIEEE Access10.1109/ACCESS.2020.29834328(61631-61641)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2983432
Show More Cited By

Index Terms

Understanding Energy Aspects of Processing-near-Memory for HPC Workloads

Recommendations

Understanding and Optimizing GPU Cache Memory Performance for Compute Workloads
ISPDC '14: Proceedings of the 2014 IEEE 13th International Symposium on Parallel and Distributed Computing

Processing elements such as CPUs and GPUs depend on cache technology to bridge the classic processor memory subsystem performance gap. As GPUs evolve into general purpose co-processors with CPUs sharing the load, good cache design and use becomes ...
Energy-efficient scheduling for memory-intensive GPGPU workloads
DATE '14: Proceedings of the conference on Design, Automation & Test in Europe

High performance for a GPGPU workload is obtained by maximizing parallelism and fully utilizing the available resources. However, this is not necessarily energy efficient, especially for memory-intensive GPGPU workloads. In this work, we propose ...
Leakage energy estimates for HPC applications
E2SC '13: Proceedings of the 1st International Workshop on Energy Efficient Supercomputing

Large-scale high-performance systems are energy constrained. With thousands of processing cores at their disposal, these machines contain large amounts of on-chip caches. With a trend of decreasing thresholds in transistors, the amount of leakage ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

MEMSYS '15: Proceedings of the 2015 International Symposium on Memory Systems

October 2015

278 pages

ISBN:9781450336048

DOI:10.1145/2818950

Copyright © 2015 ACM.

© 2015 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Sandia National Laboratories, National Nuclear Security Administration
National Science Foundation XPS-1337177

Conference

MEMSYS '15

MEMSYS '15: International Symposium on Memory Systems

October 5 - 8, 2015

DC, Washington DC, USA

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

9
Total Citations
View Citations
565
Total Downloads

Downloads (Last 12 months)77
Downloads (Last 6 weeks)8

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Do TPottier Lda Silva RSuter FCaino-Lores STaufer MDeelman E(2022)Co-scheduling Ensembles of In Situ Workflows2022 IEEE/ACM Workshop on Workflows in Support of Large-Scale Science (WORKS)10.1109/WORKS56498.2022.00011(43-51)Online publication date: Nov-2022
https://doi.org/10.1109/WORKS56498.2022.00011
Oliveira GGomez-Luna JOrosa LGhose SVijaykumar NFernandez ISadrosadati MMutlu O(2021)DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement BottlenecksIEEE Access10.1109/ACCESS.2021.31109939(134457-134502)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3110993
Kim BRhee C(2020)Making Better Use of Processing-in-Memory Through Potential-Based Task OffloadingIEEE Access10.1109/ACCESS.2020.29834328(61631-61641)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2983432
Lee JKim H(2018)StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIMIEEE Transactions on Computers10.1109/TC.2017.278023767:6(861-873)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TC.2017.2780237
Zhang CMeng TSun G(2018)PM3: Power Modeling and Power Management for Processing-in-Memory2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)10.1109/HPCA.2018.00054(558-570)Online publication date: Feb-2018
https://doi.org/10.1109/HPCA.2018.00054
Lim HPark G(2017)Triple Engine Processor (TEP)ACM Transactions on Architecture and Code Optimization10.1145/315592014:4(1-25)Online publication date: 18-Dec-2017
https://dl.acm.org/doi/10.1145/3155920
Ibrahim KFatollahi-Fard FDonofrio DShalf JJacob B(2016)Characterizing the Performance of Hybrid Memory Cube Using ApexMAP Application ProbesProceedings of the Second International Symposium on Memory Systems10.1145/2989081.2989090(429-436)Online publication date: 3-Oct-2016
https://dl.acm.org/doi/10.1145/2989081.2989090
Hong BKim GAhn JKwon YKim HKim JZaks AMendelson BRauchwerger LHwu W(2016)Accelerating Linked-list Traversal Through Near-Data ProcessingProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967958(113-124)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967958
Lee JSim JKim H(2015)BSSyncProceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)10.1109/PACT.2015.42(241-252)Online publication date: 18-Oct-2015
https://dl.acm.org/doi/10.1109/PACT.2015.42

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten