skip to main content
10.1145/3205289.3205320acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems

Published:12 June 2018Publication History

ABSTRACT

New memory technologies, such as non-volatile memory and stacked memory, have reformed the memory hierarchies in modern and emerging computer architectures. It becomes common to see memories of different types integrated into the same system, as known as heterogeneous memory. Typically, a heterogeneous memory system consists of a small fast component and a large slow component. This encourages new style of data processing and exposes developers with a new problem: given two memory types, how shall we redesign applications to benefit from this memory arrangement and decide on the efficient data placement? Existing methods perform detailed memory access pattern analysis to guide data placement. However, these methods are heavyweight and ignore the interactions between software and hardware.

To address these issues, we develop ProfDP, a lightweight profiler that employs differential data-centric analysis to provide intuitive guidance for data placement in heterogeneous memory. Evaluated with a number of parallel benchmarks running on a state-of-the-art emulator and a real machine with heterogeneous memory, we show that ProfDP is able to guide nearly-optimal data placement to maximize performance with minimum programming efforts.

References

  1. 2010. Intel® 64 and IA-32 Architectures Software Developer's Manual. (2010).Google ScholarGoogle Scholar
  2. Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Neha Agarwal, David Nellans, Mark Stephenson, Mike O'Connor, and Stephen W Keckler. 2015. Page Placement Strategies for GPUs within Heterogeneous Memory Systems. In ACM SIGPLAN Notices, Vol. 50. ACM, 607--618. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Joseph Antony, Pete P Janes, and Alistair P Rendell. 2006. Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, Ultra-SPARC/FirePlane and Opteron/HyperTransport. In International Conference on High-Performance Computing. Springer, 338--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Rajeshwari Banakar, Stefan Steinke, Bo-Sik Lee, Mahesh Balakrishnan, and Peter Marwedel. 2002. Scratchpad Memory: Design Alternative for Cache On-chip Memory in Embedded Systems. In Proceedings of the tenth international symposium on Hardware/software codesign. ACM, 73--78. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. The PARSEC benchmark suite: characterization and architectural implications. In Proc. of the 17th Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Christopher Cantalupo, Vishwanath Venkatesan, Jeff Hammond, Krzysztof Czurlyo, and Simon David Hammond. 2015. Memkind: An Extensible Heap Memory Manager for Heterogeneous Memory Platforms and Mixed Memory Policies. Technical Report. Sandia National Laboratories (SNL-NM), Albuquerque, NM (United States).Google ScholarGoogle Scholar
  8. Niladrish Chatterjee, Manjunath Shevgoor, Rajeev Balasubramonian, Al Davis, Zhen Fang, Ramesh Illikkal, and Ravi Iyer. 2012. Leveraging Heterogeneity in DRAM Main Memories to Accelerate Critical Word Access. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Shuai Che. 2009. Rodinia NW Benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Needleman-Wunsch. (2009).Google ScholarGoogle Scholar
  10. Shuai Che. 2009. Rodinia Streamcluster Benchmark. https://www.cs.virginia.edu/~skadron/wiki/rodinia/index.php/Streamcluster. (2009).Google ScholarGoogle Scholar
  11. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Proc. of the 2009 IEEE Intl. Symp. on Workload Characterization (IISWC). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Guoyang Chen and Xipeng Shen. 2016. Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16). ACM, New York, NY, USA, Article 14, 13 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Guoyang Chen, Bo Wu, Dong Li, and Xipeng Shen. 2014. PORPLE: An Extensible Optimizer for Portable Data Placement on GPU. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 88--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Cristian Coarfa, John Mellor-Crummey, Nathan Froyd, and Yuri Dotsenko. 2007. Scalability Analysis of SPMD Codes Using Expectations. In Proceedings of the 21st annual international conference on Supercomputing. ACM, 13--22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Intel Corp. 2014. NVM Library. http://pmem.io/nvml/. (2014).Google ScholarGoogle Scholar
  16. Intel Corporation. 2016. Intel Resource Director Technology. https://events.linuxfoundation.org/sites/events/files/slides/cat8.pdf. (2016).Google ScholarGoogle Scholar
  17. Paul J. Drongowski. 2007. Instruction-Based Sampling: A New Performance Analysis Technique for AMD Family 10h Processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf. (November 2007). Last accessed: Dec. 13, 2013.Google ScholarGoogle Scholar
  18. Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016. Data Tiering in Heterogeneous Memory Systems. In Proceedings of the Eleventh European Conference on Computer Systems. ACM, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Eklov, N. Nikoleris, D. Black-Schaffer, and E. Hagersten. 2013. Bandwidth Bandit: Quantitative Characterization of Memory Contention. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. William Gropp. 2016. Graph500 Benchmark. http://www.graph500.org/. (2016).Google ScholarGoogle Scholar
  21. Ahmad Hassan, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2015. Software-managed Energy-efficient Hybrid DRAM/NVM Main Memory. In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15). ACM, New York, NY, USA, Article 23, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Intel Corporation. 2010. Linux Performance Tool. http://www.brendangregg.com/linuxperf.html. (2010).Google ScholarGoogle Scholar
  23. Michael R Jantz, Carl Strickland, Karthik Kumar, Martin Dimitrov, and Kshitij A Doshi. 2013. A Framework for Application Guidance in Virtual Memory Systems. In ACM SIGPLAN Notices, Vol. 48. ACM, 155--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973. 1--9 pages.Google ScholarGoogle Scholar
  25. Martijn HR Lankhorst, Bas WSMM Ketelaars, and RAM Wolters. 2005. Low-cost and Nanoscale Nonvolatile Memory Concept for Future Silicon Chips. Nature materials 4, 4 (2005), 347--352.Google ScholarGoogle Scholar
  26. Lawrence Livermore National Laboratory. {n. d.}. LLNL Coral Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. ({n. d.}). Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  27. Dong Li, Jeffrey S. Vetter, Gabriel Marin, Collin McCurdy, Cristian Cira, Zhuo Liu, and Weikuan Yu. 2012. Identifying Opportunities for Byte-Addressable Non-Volatile Memory in Extreme-Scale Scientific Applications. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium (IPDPS '12). IEEE Computer Society, Washington, DC, USA, 945--956. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Felix Xiaozhu Lin and Xu Liu. 2016. Memif: Towards Programming Heterogeneous Memory Asynchronously. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). 369--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. John DC Little and Stephen C Graves. 2008. Little's Law. In Building intuition. Springer, 81--100.Google ScholarGoogle Scholar
  30. Xu Liu and John Mellor-Crummey. 2013. Pinpointing data locality bottlenecks with low overhead. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 183--193.Google ScholarGoogle ScholarCross RefCross Ref
  31. Xu Liu and Bo Wu. 2015. ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '15). ACM, New York, NY, USA, Article 47, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. LLNL. 2014. LLNL AMG Benchmark. https://asc.llnl.gov/CORAL-benchmarks. (2014).Google ScholarGoogle Scholar
  33. Gabriel H Loh. 2008. 3D-stacked memory architectures for multicore processors. In ACM SIGARCH computer architecture news, Vol. 36. IEEE Computer Society, 453--464. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Paul E McKenney. 1995. Differential Profiling. In Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, 1995. MASCOTS'95., Proceedings of the Third International Workshop on. IEEE, 237--241. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. NASA. 2016. NAS Benchmark. http://www.nas.nasa.gov/publications/npb.html. (2016).Google ScholarGoogle Scholar
  36. Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. 2017. RTHMS: A Tool for Data Placement on Hybrid Memory System. In Proceedings of the 2017 ACM SIGPLAN International Symposium on Memory Management (ISMM 2017). ACM, New York, NY, USA, 82--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Luiz E Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page Placement in Hybrid Memory Systems. In Proceedings of the international conference on Supercomputing. ACM, 85--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sangmin Seo, Gangwon Jo, and Jaejin Lee. 2011. Performance Characterization of the NAS Parallel Benchmarks in OpenCL. In Proceedings of the 2011 IEEE International Symposium on Workload Characterization (IISWC '11). IEEE Computer Society, Washington, DC, USA, 137--148. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Du Shen, Xu Liu, and Felix Xiaozhu Lin. 2016. Characterizing Emerging Heterogeneous Memory. In Proceedings of the 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM 2016). ACM, New York, NY, USA, 13--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Avinash Sodani. 2015. Knights Landing (KNL): 2nd Generation Intel® Xeon Phi processor. In Hot Chips 27 Symposium (HCS), 2015 IEEE. IEEE, 1--24.Google ScholarGoogle ScholarCross RefCross Ref
  41. Nathan Russell Tallent. 2010. Performance analysis for parallel programs from multicore to petascale. Ph.D. Dissertation. Rice University.Google ScholarGoogle Scholar
  42. Nathan R. Tallent, Laksono Adhianto, and John M. Mellor-Crummey. 2010. Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles. In SC. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Nathan R. Tallent, John Mellor-Crummey, and Michael W. Fagan. 2009. Binary Analysis for Measurement and Attribution of Program Performance. In Proc. of the 2009 ACM PLDI. ACM, NY, NY, USA, 441--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Haris Volos, Guilherme Magalhaes, Ludmila Cherkasova, and JunLi. 2015. Quartz: A Lightweight Performance Emulator for Persistent Memory Software. In Proceedings of the 16th Annual Middleware Conference (Middleware '15). ACM, New York, NY, USA, 37--49. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Wei Wei, Dejun Jiang, Sally A. McKee, Jin Xiong, and Mingyu Chen. 2015. Exploiting Program Semantics to Place Data in Hybrid Memory. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 163--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Panruo Wu, Dong Li, Zizhong Chen, Jeffrey S Vetter, and Sparsh Mittal. 2016. Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory. In Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing. ACM, 141--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Xiaoya Xiang, Chen Ding, Hao Luo, and Bin Bao. 2013. HOTL: A Higher Order Theory of Locality. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 343--356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael A Harding, and Onur Mutlu. 2012. Row Buffer Locality Aware Caching Policies for Hybrid Memories. In Computer Design (ICCD), 2012 IEEE 30th International Conference on. IEEE, 337--344. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, and Myoungsoo Jung. 2015. NVMMU: A Nonvolatile Memory Management Unit for Heterogeneous GPU-SSD Architectures. In Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT) (PACT '15). IEEE Computer Society, Washington, DC, USA, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ProfDP: A Lightweight Profiler to Guide Data Placement in Heterogeneous Memory Systems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICS '18: Proceedings of the 2018 International Conference on Supercomputing
          June 2018
          407 pages
          ISBN:9781450357838
          DOI:10.1145/3205289

          Copyright © 2018 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 12 June 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate584of2,055submissions,28%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader