research-article

Understanding object-level memory access patterns across the spectrum

Authors:
Xu Ji

Tsinghua University

Tsinghua University
View Profile

,
Chao Wang

Oak Ridge National Laboratory

Oak Ridge National Laboratory
View Profile

,
Nosayba El-Sayed

CSAIL, MIT

CSAIL, MIT
View Profile

,
Xiaosong Ma

Qatar Computing Research Institute

Qatar Computing Research Institute
View Profile

,
Youngjae Kim

Sogang University

Sogang University
View Profile

,
Sudharshan S. Vazhkudai

Oak Ridge National Laboratory

Oak Ridge National Laboratory
View Profile

,
Wei Xue

Tsinghua University

Tsinghua University
View Profile

,
Daniel Sanchez

CSAIL, MIT

CSAIL, MIT
View Profile

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2017Article No.: 25Pages 1–12https://doi.org/10.1145/3126908.3126917

Published:12 November 2017Publication History

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.

In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.

References

Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX (2015).Google Scholar
Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010). Google ScholarDigital Library
Joseph Antony, Pete P Janes, and Alistair P Rendell. Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/Hyper Transport. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2006. Google ScholarDigital Library
David A Barrett and Benjamin G Zorn. Using lifetime predictors to improve memory allocation performance. In ACM SIGPLAN Notices, 1993. Google ScholarDigital Library
Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. 1998. Memory system characterization of commercial workloads. ACM SIGARCH Computer Architecture News (1998). Google ScholarDigital Library
Bradford M Beckmann and David A Wood. Managing wire delay in large chip-multiprocessor caches. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2004. Google ScholarDigital Library
Emery D Berger, Benjamin G Zorn, and Kathryn S McKinley. OOPSLA 2002: Reconsidering custom memory allocation. In ACM SIGPLAN Notices, 2013. Google ScholarDigital Library
Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: Characterization and architectural implications. In IEEE Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google ScholarDigital Library
Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc, 2008. Google ScholarDigital Library
Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cache-conscious data placement. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998. Google ScholarDigital Library
Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In ACM Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2001. Google ScholarDigital Library
Zeshan Chishti, Michael D Powell, and TN Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2015. Google ScholarDigital Library
Aaron Darling, Lucas Carey, and Wu-chun Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Proceedings of ClusterWorld (2003).Google Scholar
Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In ACM SIGPLAN Notices, 2013. Google ScholarDigital Library
Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, 2003. Google ScholarDigital Library
Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data tiering in heterogeneous memory systems. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2016. Google ScholarDigital Library
Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarDigital Library
Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. Reducing overheads for acquiring dynamic memory traces. In IEEE International Symposium on Workload Characterization (IISWC), 2005.Google Scholar
Xiaofeng Gao and Allan Snavely. Exploiting stability to reduce time-space cost for memory tracing. In International Conference on Computational Science (ICCS), 2003. Google ScholarDigital Library
Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. Base-victim compression: An opportunistic cache compression architecture. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. Dissecting on-node memory access performance: a semantic approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014. Google ScholarDigital Library
Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. 2013. Locality principle revisited: A probability-based quantitative approach. J. Parallel and Distrib. Comput. (2013).Google Scholar
Simon D. Hammond, Arun F. Rodrigues, and Gwendolyn R. Voskuilen. Multi-Level memory policies: what you add is more important than what you take out. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarDigital Library
Stavros Harizopoulos, Daniel J Abadi, Samuel Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In ACM Proceedings of the International Conference on Management of Data (SIGMOD), 2008. Google ScholarDigital Library
Akanksha Jain and Calvin Lin. Back to the future: Leveraging Belady's algorithm for improved cache replacement. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News (2010). Google ScholarDigital Library
Tomislav Janjusic and Krishna Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News (2013). Google ScholarDigital Library
Zhang Jing, Deng Lin, and Dou Yong. Data locality characterization of OLTP applications and its effects on cache performance. In International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010.Google Scholar
Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L Madden. 2008. NCBI BLAST: a better web interface. Nucleic Acids Research (2008).Google Scholar
Harshad Kasture and Daniel Sanchez. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In IEEE International Symposium on Workload Characterization (IISWC), 2016.Google ScholarCross Ref
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU concurrency in heterogeneous architectures. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
Sandia National Laboratories. 2007. LAMMPS Molecular Dynamics Simulator. (2007). http://lammps.sandia.gov/.Google Scholar
Xiaoyao Liang, Gu-Yeon Wei, and David Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variable latency. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2008. Google ScholarDigital Library
Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarDigital Library
Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, 2005. Google ScholarDigital Library
Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. Probabilistic shared cache management (PriSM). In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2012. Google ScholarDigital Library
Jaydeep Marathe, Frank Mueller, Tushar Mohan, Bronis R de Supinski, Sally A McKee, and Andy Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization (CGO), 2003. Google ScholarDigital Library
Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. Whirlpool: Improving dynamic cache management with static data classification. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016. Google ScholarDigital Library
Richard C Murphy and Peter M Kogge. 2007. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Trans. Comput. (2007). Google ScholarDigital Library
Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. A first-order mechanistic model for architectural vulnerability factor. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
NASA. 2007. The NAS Parallel Benchmarks. (2007). https://www.nas.nasa.gov/publications/npb.html.Google Scholar
U. D. of Energy. 2007. DOE exascale initiative technical roadmap. (2007). http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf.Google Scholar
Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. RTHMS: A tool for data placement on hybrid memory system. In ACM Proceedings of the SIGPLAN International Symposium on Memory Management (ISMM), 2017. Google ScholarDigital Library
Sokhom Pheng and Clark Verbrugge. Dynamic data structure analysis for Java programs. In IEEE Proceedings of the International Conference on Program Comprehension (ICPC), 2006. Google ScholarDigital Library
Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014.Google Scholar
Easwaran Raman and David I. August. Recursive data structure profiling. In ACM Proceedings of the Workshop on Memory System Performance (MSP), 2005. Google ScholarDigital Library
Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM Proceedings of the SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2002. Google ScholarDigital Library
Matthew L Seidl and Benjamin G Zorn. 1997. Predicting references to dynamically allocated objects. University of Colorado Technical Report (1997).Google Scholar
Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The dirty-block index. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
Julian Shun and Guy E Blelloch. Ligra: A lightweight graph processing framework for shared memory. In ACM Sigplan Notices, 2013. Google ScholarDigital Library
Julian Shun, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In ACM Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012. Google ScholarDigital Library
The Standard Performance Evaluation Corporation (SPEC). 2007. The SPEC benchmarks. (2007). http://www.spec.org/.Google Scholar
TOP500. 2007. TOP500 Supercomputer Sites. (2007). http://www.top500.org/.Google Scholar
Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In ACM Proceedings of the Symposium on Operating Systems Principles (SOSP), 2013. Google ScholarDigital Library
Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. Analyzing allocation behavior for multi-level memory. In Proceedings of the International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarDigital Library
Chao Wang, Sudharshan S Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2012. Google ScholarDigital Library
Ruisheng Wang and Lizhong Chen. Futility scaling: High-associativity cache partitioning. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
Yijian Wang and David Kaeli. Profile-guided I/O partitioning. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003. Google ScholarDigital Library
Jonathan Weinberg, Michael O McCracken, Erich Strohmaier, and Allan Snavely. Quantifying locality in the memory access patterns of hpc applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2005. Google ScholarDigital Library
Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. ACM SIGARCH Computer Architecture News (2005). Google ScholarDigital Library
Qiang Wu, Artem Pyatakov, Alexey Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August. Exposing memory access regularities using object-relative memory profiling. In International Symposium on Code Generation and Optimization (CGO), 2004. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Proceedings of the Conference on Networked Systems Design and Implementation (NSDI), 2012. Google ScholarDigital Library

Recommendations

Temporal characterization of memory access behaviors in SPEC CPU2017 workloads: Analysis and synthesis
Abstract
The SPEC CPU2017 benchmark suite has received wide attention in both academia and industry. However, few work have studied the memory behaviors in SPEC CPU2017 workloads from a time dependence perspective. We run all SPEC CPU2017 benchmarks and ...
Highlights
- Observing some interesting phenomena not seen in SPEC CPU2006 workloads.
- The correlation of access intervals in SPEC CPU2017 differs significantly from CPU2006.
- All Hurst estimates confirm the wide existence of self-similarity in ...
Read More
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation Conference

Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
Read More
SPEC CPU2006 sensitivity to memory page sizes

SPEC CPU2006 is a compute-intensive industry standard benchmark suite published in August 2006. This paper characterizes the memory access behavior of SPEC CPU2006 running on IBM POWER5+ microprocessors. We measure the maximum and average memory usage ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2017
801 pages
ISBN:9781450351140
DOI:10.1145/3126908
General Chair:
Bernd Mohr
Jülich Supercomputing Center, Jülich, Germany
,
Program Chair:
Padma Raghavan
Vanderbilt University, Nashville, TN
Copyright © 2017 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data types and structures
memory profiling
object access patterns
tracing
workload characterization
Qualifiers
- research-article
Conference

Acceptance Rates
SC '17 Paper Acceptance Rate61of327submissions,19%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 13
  Total Citations
  View Citations
- 430
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Understanding object-level memory access patterns across the spectrum

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Temporal characterization of memory access behaviors in SPEC CPU2017 workloads: Analysis and synthesis

Understanding the trade-offs in multi-level cell ReRAM memory design

SPEC CPU2006 sensitivity to memory page sizes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Understanding object-level memory access patterns across the spectrum

SC '17: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Recommendations

Temporal characterization of memory access behaviors in SPEC CPU2017 workloads: Analysis and synthesis

Understanding the trade-offs in multi-level cell ReRAM memory design

SPEC CPU2006 sensitivity to memory page sizes

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media