ABSTRACT
New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware.
To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.
- 2010. Intel® 64 and IA-32 Architectures Software Developer's Manual. (2010).Google Scholar
- Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010). Google ScholarDigital Library
- Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W Keckler. 2015. Page Placement Strategies for GPUs within Heterogeneous Memory Systems. In ACM SIGPLAN Notices, Vol. 50. ACM, 607--618. Google ScholarDigital Library
- Joseph Antony, Pete P Janes, and Alistair P Rendell. 2006. Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, Ultra-SPARC/FirePlane and Opteron/HyperTransport. In International Conference on High-Performance Computing. Springer, 338--352. Google ScholarDigital Library
- Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems. In Proceedings of the tenth international symposium on Hardware/software codesign. ACM, 73--78. Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: characterization and architectural implications. In Proc. of the 17th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT). Google ScholarDigital Library
- Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).Google Scholar
- Niladrish Chatterjee, Manjunath Shevgoor, Rajeev Balasubramonian, Al Davis, Zhen Fang, Ramesh Illikkal, and Ravi Iyer. 2012. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). 13--24. Google ScholarDigital Library
- Shuai Che. 2009. Rodinia NW Benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Needleman-Wunsch. (2009).Google Scholar
- Shuai Che. 2009. Rodinia Streamcluster Benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Streamcluster. (2009).Google Scholar
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization (IISWC). Google ScholarDigital Library
- Guoyang Chen and Xipeng Shen. 2016. Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 14, 13 pages. Google ScholarDigital Library
- Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. 2014. PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 88--100. Google ScholarDigital Library
- Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. 2007. Scalability Analysis of SPMD Codes Using Expectations. In Proceedings of the 21st annual international conference on Supercomputing. ACM, 13--22. Google ScholarDigital Library
- Intel Corp. 2014. NVM Library. http://pmem.io/nvml/. (2014).Google Scholar
- Intel Corporation. 2016. Intel Resource Director Technology. https://events.linuxfoundation.org/sites/events/files/slides/cat8.pdf. (2016).Google Scholar
- Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf. (November 2007). Last accessed: Dec. 13, 2013.Google Scholar
- Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 15. Google ScholarDigital Library
- D. Eklov, N. Nikoleris, D. Black-Schaffer, and E. Hagersten. 2013. Bandwidth Bandit: Quantitative Characterization of Memory Contention. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10. Google ScholarDigital Library
- William Gropp. 2016. Graph500 Benchmark. http://www.graph500.org/. (2016).Google Scholar
- Ahmad Hassan, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. Software-managed Energy-efficient Hybrid DRAM/NVM Main Memory. In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15). ACM, New York, NY, USA, Article 23, 8 pages. Google ScholarDigital Library
- Intel Corporation. 2010. Linux Performance Tool. http://www.brendangregg.com/linuxperf.html. (2010).Google Scholar
- Michael R Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, and Kshitij A Doshi. 2013. A Framework for Application Guidance in Virtual Memory Systems. In ACM SIGPLAN Notices, Vol. 48. ACM, 155--166. Google ScholarDigital Library
- Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973. 1--9 pages.Google Scholar
- Martijn HR Lankhorst, Bas WSMM Ketelaars, and RAM Wolters. 2005. Low-cost and Nanoscale Nonvolatile Memory Concept for Future Silicon Chips. Nature materials 4, 4 (2005), 347--352.Google Scholar
- Lawrence Livermore National Laboratory. {n. d.}. LLNL Coral Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. ({n. d.}). Last accessed: Dec. 12, 2013.Google Scholar
- Dong Li, Jeffrey S. Vetter, Gabriel Marin, Collin McCurdy, Cristian Cira, Zhuo Liu, and Weikuan Yu. 2012. Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS '12). IEEE Computer Society, Washington, DC, USA, 945--956. Google ScholarDigital Library
- Felix Xiaozhu Lin and Xu Liu. 2016. Memif: Towards Programming Heterogeneous Memory Asynchronously. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). 369--383. Google ScholarDigital Library
- John DC Little and Stephen C Graves. 2008. Little's Law. In Building intuition. Springer, 81--100.Google Scholar
- Xu Liu and John Mellor-Crummey. 2013. Pinpointing data locality bottlenecks with low overhead. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 183--193.Google ScholarCross Ref
- Xu Liu and Bo Wu. 2015. ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 47, 12 pages. Google ScholarDigital Library
- LLNL. 2014. LLNL AMG Benchmark. https://asc.llnl.gov/CORAL-benchmarks. (2014).Google Scholar
- Gabriel H Loh. 2008. 3D-stacked memory architectures for multicore processors. In ACM SIGARCH computer architecture news, Vol. 36. IEEE Computer Society, 453--464. Google ScholarDigital Library
- Paul E McKenney. 1995. Differential Profiling. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 1995. MASCOTS'95., Proceedings of the Third International Workshop on. IEEE, 237--241. Google ScholarDigital Library
- NASA. 2016. NAS Benchmark. http://www.nas.nasa.gov/publications/npb.html. (2016).Google Scholar
- Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2017. RTHMS: A Tool for Data Placement on Hybrid Memory System. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM 2017). ACM, New York, NY, USA, 82--91. Google ScholarDigital Library
- Luiz E Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the international conference on Supercomputing. ACM, 85--95. Google ScholarDigital Library
- Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization (IISWC '11). IEEE Computer Society, Washington, DC, USA, 137--148. Google ScholarDigital Library
- Du Shen, Xu Liu, and Felix Xiaozhu Lin. 2016. Characterizing Emerging Heterogeneous Memory. In Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM 2016). ACM, New York, NY, USA, 13--23. Google ScholarDigital Library
- Avinash Sodani. 2015. Knights Landing (KNL): 2nd Generation Intel® Xeon Phi processor. In Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, 1--24.Google ScholarCross Ref
- Nathan Russell Tallent. 2010. Performance analysis for parallel programs from multicore to petascale. Ph.D. Dissertation. Rice University.Google Scholar
- Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. 2010. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles. In SC. Google ScholarDigital Library
- Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. 2009. Binary Analysis for Measurement and Attribution of Program Performance. In Proc. of the 2009 ACM PLDI. ACM, NY, NY, USA, 441--452. Google ScholarDigital Library
- Haris Volos, Guilherme Magalhaes, Ludmila Cherkasova, and JunLi. 2015. Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proceedings of the 16th Annual Middleware Conference (Middleware '15). ACM, New York, NY, USA, 37--49. Google ScholarDigital Library
- Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. 2015. Exploiting Program Semantics to Place Data in Hybrid Memory. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 163--173. Google ScholarDigital Library
- Panruo Wu, Dong Li, Zizhong Chen, Jeffrey S Vetter, and Sparsh Mittal. 2016. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 141--152. Google ScholarDigital Library
- Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: A Higher Order Theory of Locality. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 343--356. Google ScholarDigital Library
- HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A Harding, and Onur Mutlu. 2012. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In Computer Design (ICCD), 2012 IEEE 30th International Conference on. IEEE, 337--344. Google ScholarDigital Library
- Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, and Myoungsoo Jung. 2015. NVMMU: A Nonvolatile Memory Management Unit for Heterogeneous GPU-SSD Architectures. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 13--24. Google ScholarDigital Library
Index Terms
ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems
Recommendations
H2M: Exploiting Heterogeneous Shared Memory Architectures
AbstractOver the past decades, the performance gap between the memory subsystem and compute capabilities continued to spread. However, scientific applications and simulations show increasing demand for both memory speed and capacity. To tackle these ...
Graphical abstractDisplay Omitted
Highlights- Analysis and characterization of contemporary heterogeneous memory technologies.
- Novel methodology to efficiently manage data placement in heterogeneous memory.
- Evaluation with several kernels on regular, high-bandwidth and large-...
Reliability and Performance Trade-off Study of Heterogeneous Memories
MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsHeterogeneous memories, organized as die-stacked in-package and off-package memory, have been a focus of attention by the computer architects to improve memory bandwidth and capacity. Researchers have explored methods and organizations to optimize ...
System evaluation of the Intel optane byte-addressable NVM
MEMSYS '19: Proceedings of the International Symposium on Memory SystemsByte-addressable non-volatile memory (NVM) features high density, DRAM comparable performance, and persistence. These characteristics position NVM as a promising new tier in the memory hierarchy. Nevertheless, NVM has asymmetric read and write ...
Comments