skip to main content
10.1145/3126908.3126917acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Understanding object-level memory access patterns across the spectrum

Published:12 November 2017Publication History

ABSTRACT

Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.

In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.

References

  1. Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX (2015).Google ScholarGoogle Scholar
  2. Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Joseph Antony, Pete P Janes, and Alistair P Rendell. Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/Hyper Transport. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. David A Barrett and Benjamin G Zorn. Using lifetime predictors to improve memory allocation performance. In ACM SIGPLAN Notices, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. 1998. Memory system characterization of commercial workloads. ACM SIGARCH Computer Architecture News (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bradford M Beckmann and David A Wood. Managing wire delay in large chip-multiprocessor caches. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Emery D Berger, Benjamin G Zorn, and Kathryn S McKinley. OOPSLA 2002: Reconsidering custom memory allocation. In ACM SIGPLAN Notices, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: Characterization and architectural implications. In IEEE Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cache-conscious data placement. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In ACM Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Zeshan Chishti, Michael D Powell, and TN Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Aaron Darling, Lucas Carey, and Wu-chun Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Proceedings of ClusterWorld (2003).Google ScholarGoogle Scholar
  14. Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In ACM SIGPLAN Notices, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data tiering in heterogeneous memory systems. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. Reducing overheads for acquiring dynamic memory traces. In IEEE International Symposium on Workload Characterization (IISWC), 2005.Google ScholarGoogle Scholar
  19. Xiaofeng Gao and Allan Snavely. Exploiting stability to reduce time-space cost for memory tracing. In International Conference on Computational Science (ICCS), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. Base-victim compression: An opportunistic cache compression architecture. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. Dissecting on-node memory access performance: a semantic approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. 2013. Locality principle revisited: A probability-based quantitative approach. J. Parallel and Distrib. Comput. (2013).Google ScholarGoogle Scholar
  23. Simon D. Hammond, Arun F. Rodrigues, and Gwendolyn R. Voskuilen. Multi-Level memory policies: what you add is more important than what you take out. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Stavros Harizopoulos, Daniel J Abadi, Samuel Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In ACM Proceedings of the International Conference on Management of Data (SIGMOD), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Akanksha Jain and Calvin Lin. Back to the future: Leveraging Belady's algorithm for improved cache replacement. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tomislav Janjusic and Krishna Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhang Jing, Deng Lin, and Dou Yong. Data locality characterization of OLTP applications and its effects on cache performance. In International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010.Google ScholarGoogle Scholar
  29. Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L Madden. 2008. NCBI BLAST: a better web interface. Nucleic Acids Research (2008).Google ScholarGoogle Scholar
  30. Harshad Kasture and Daniel Sanchez. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In IEEE International Symposium on Workload Characterization (IISWC), 2016.Google ScholarGoogle ScholarCross RefCross Ref
  31. Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU concurrency in heterogeneous architectures. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Sandia National Laboratories. 2007. LAMMPS Molecular Dynamics Simulator. (2007). http://lammps.sandia.gov/.Google ScholarGoogle Scholar
  33. Xiaoyao Liang, Gu-Yeon Wei, and David Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variable latency. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. Probabilistic shared cache management (PriSM). In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jaydeep Marathe, Frank Mueller, Tushar Mohan, Bronis R de Supinski, Sally A McKee, and Andy Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization (CGO), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. Whirlpool: Improving dynamic cache management with static data classification. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Richard C Murphy and Peter M Kogge. 2007. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Trans. Comput. (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. A first-order mechanistic model for architectural vulnerability factor. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. NASA. 2007. The NAS Parallel Benchmarks. (2007). https://www.nas.nasa.gov/publications/npb.html.Google ScholarGoogle Scholar
  43. U. D. of Energy. 2007. DOE exascale initiative technical roadmap. (2007). http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf.Google ScholarGoogle Scholar
  44. Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. RTHMS: A tool for data placement on hybrid memory system. In ACM Proceedings of the SIGPLAN International Symposium on Memory Management (ISMM), 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Sokhom Pheng and Clark Verbrugge. Dynamic data structure analysis for Java programs. In IEEE Proceedings of the International Conference on Program Comprehension (ICPC), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014.Google ScholarGoogle Scholar
  47. Easwaran Raman and David I. August. Recursive data structure profiling. In ACM Proceedings of the Workshop on Memory System Performance (MSP), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM Proceedings of the SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Matthew L Seidl and Benjamin G Zorn. 1997. Predicting references to dynamically allocated objects. University of Colorado Technical Report (1997).Google ScholarGoogle Scholar
  50. Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The dirty-block index. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Julian Shun and Guy E Blelloch. Ligra: A lightweight graph processing framework for shared memory. In ACM Sigplan Notices, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Julian Shun, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In ACM Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. The Standard Performance Evaluation Corporation (SPEC). 2007. The SPEC benchmarks. (2007). http://www.spec.org/.Google ScholarGoogle Scholar
  54. TOP500. 2007. TOP500 Supercomputer Sites. (2007). http://www.top500.org/.Google ScholarGoogle Scholar
  55. Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In ACM Proceedings of the Symposium on Operating Systems Principles (SOSP), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. Analyzing allocation behavior for multi-level memory. In Proceedings of the International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Chao Wang, Sudharshan S Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Ruisheng Wang and Lizhong Chen. Futility scaling: High-associativity cache partitioning. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Yijian Wang and David Kaeli. Profile-guided I/O partitioning. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Jonathan Weinberg, Michael O McCracken, Erich Strohmaier, and Allan Snavely. Quantifying locality in the memory access patterns of hpc applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. ACM SIGARCH Computer Architecture News (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Qiang Wu, Artem Pyatakov, Alexey Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August. Exposing memory access regularities using object-relative memory profiling. In International Symposium on Code Generation and Optimization (CGO), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Proceedings of the Conference on Networked Systems Design and Implementation (NSDI), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
    November 2017
    801 pages
    ISBN:9781450351140
    DOI:10.1145/3126908
    • General Chair:
    • Bernd Mohr,
    • Program Chair:
    • Padma Raghavan

    Copyright © 2017 ACM

    Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 12 November 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader