ABSTRACT
Memory accesses limit the performance and scalability of countless applications. Many design and optimization efforts will benefit from an in-depth understanding of memory access behavior, which is not offered by extant access tracing and profiling methods.
In this paper, we adopt a holistic memory access profiling approach to enable a better understanding of program-system memory interactions. We have developed a two-pass tool adopting fast online and slow offline profiling, with which we have profiled, at the variable/object level, a collection of 38 representative applications spanning major domains (HPC, personal computing, data analytics, AI, graph processing, and datacenter workloads), at varying problem sizes. We have performed detailed result analysis and code examination. Our findings provide new insights into application memory behavior, including insights on per-object access patterns, adoption of data structures, and memory-access changes at different problem sizes. We find that scientific computation applications exhibit distinct behaviors compared to datacenter workloads, motivating separate memory system design/optimizations.
- Mark James Abraham, Teemu Murtola, Roland Schulz, Szilárd Páll, Jeremy C Smith, Berk Hess, and Erik Lindahl. 2015. GROMACS: High performance molecular simulations through multi-level parallelism from laptops to supercomputers. SoftwareX (2015).Google Scholar
- Laksono Adhianto, Sinchan Banerjee, Mike Fagan, Mark Krentel, Gabriel Marin, John Mellor-Crummey, and Nathan R Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience (2010). Google ScholarDigital Library
- Joseph Antony, Pete P Janes, and Alistair P Rendell. Exploring thread and memory placement on NUMA architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/Hyper Transport. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2006. Google ScholarDigital Library
- David A Barrett and Benjamin G Zorn. Using lifetime predictors to improve memory allocation performance. In ACM SIGPLAN Notices, 1993. Google ScholarDigital Library
- Luiz André Barroso, Kourosh Gharachorloo, and Edouard Bugnion. 1998. Memory system characterization of commercial workloads. ACM SIGARCH Computer Architecture News (1998). Google ScholarDigital Library
- Bradford M Beckmann and David A Wood. Managing wire delay in large chip-multiprocessor caches. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2004. Google ScholarDigital Library
- Emery D Berger, Benjamin G Zorn, and Kathryn S McKinley. OOPSLA 2002: Reconsidering custom memory allocation. In ACM SIGPLAN Notices, 2013. Google ScholarDigital Library
- Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The PARSEC benchmark suite: Characterization and architectural implications. In IEEE Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), 2008. Google ScholarDigital Library
- Gary Bradski and Adrian Kaehler. Learning OpenCV: Computer vision with the OpenCV library. O'Reilly Media, Inc, 2008. Google ScholarDigital Library
- Brad Calder, Chandra Krintz, Simmi John, and Todd Austin. Cache-conscious data placement. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 1998. Google ScholarDigital Library
- Trishul M. Chilimbi. Efficient representations and abstractions for quantifying and exploiting data reference locality. In ACM Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2001. Google ScholarDigital Library
- Zeshan Chishti, Michael D Powell, and TN Vijaykumar. Optimizing replication, communication, and capacity allocation in CMPs. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2015. Google ScholarDigital Library
- Aaron Darling, Lucas Carey, and Wu-chun Feng. 2003. The design, implementation, and evaluation of mpiBLAST. Proceedings of ClusterWorld (2003).Google Scholar
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. Traffic management: A holistic approach to memory placement on NUMA systems. In ACM SIGPLAN Notices, 2013. Google ScholarDigital Library
- Chen Ding and Yutao Zhong. Predicting whole-program locality through reuse distance analysis. In ACM SIGPLAN Notices, 2003. Google ScholarDigital Library
- Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram, Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. Data tiering in heterogeneous memory systems. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2016. Google ScholarDigital Library
- Michael Ferdman, Pejman Lotfi-Kamran, Ken Balet, and Babak Falsafi. Cuckoo directory: A scalable directory for many-core systems. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarDigital Library
- Xiaofeng Gao, Michael Laurenzano, Beth Simon, and Allan Snavely. Reducing overheads for acquiring dynamic memory traces. In IEEE International Symposium on Workload Characterization (IISWC), 2005.Google Scholar
- Xiaofeng Gao and Allan Snavely. Exploiting stability to reduce time-space cost for memory tracing. In International Conference on Computational Science (ICCS), 2003. Google ScholarDigital Library
- Jayesh Gaur, Alaa R Alameldeen, and Sreenivas Subramoney. Base-victim compression: An opportunistic cache compression architecture. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
- Alfredo Giménez, Todd Gamblin, Barry Rountree, Abhinav Bhatele, Ilir Jusufi, Peer-Timo Bremer, and Bernd Hamann. Dissecting on-node memory access performance: a semantic approach. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014. Google ScholarDigital Library
- Saurabh Gupta, Ping Xiang, Yi Yang, and Huiyang Zhou. 2013. Locality principle revisited: A probability-based quantitative approach. J. Parallel and Distrib. Comput. (2013).Google Scholar
- Simon D. Hammond, Arun F. Rodrigues, and Gwendolyn R. Voskuilen. Multi-Level memory policies: what you add is more important than what you take out. In Proceedings of the Second International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarDigital Library
- Stavros Harizopoulos, Daniel J Abadi, Samuel Madden, and Michael Stonebraker. OLTP through the looking glass, and what we found there. In ACM Proceedings of the International Conference on Management of Data (SIGMOD), 2008. Google ScholarDigital Library
- Akanksha Jain and Calvin Lin. Back to the future: Leveraging Belady's algorithm for improved cache replacement. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2016. Google ScholarDigital Library
- Aamer Jaleel, Kevin B Theobald, Simon C Steely Jr, and Joel Emer. 2010. High performance cache replacement using re-reference interval prediction (RRIP). ACM SIGARCH Computer Architecture News (2010). Google ScholarDigital Library
- Tomislav Janjusic and Krishna Kavi. 2013. Gleipnir: A memory profiling and tracing tool. ACM SIGARCH Computer Architecture News (2013). Google ScholarDigital Library
- Zhang Jing, Deng Lin, and Dou Yong. Data locality characterization of OLTP applications and its effects on cache performance. In International Conference on Advanced Computer Theory and Engineering (ICACTE), 2010.Google Scholar
- Mark Johnson, Irena Zaretskaya, Yan Raytselis, Yuri Merezhuk, Scott McGinnis, and Thomas L Madden. 2008. NCBI BLAST: a better web interface. Nucleic Acids Research (2008).Google Scholar
- Harshad Kasture and Daniel Sanchez. Tailbench: A benchmark suite and evaluation methodology for latency-critical applications. In IEEE International Symposium on Workload Characterization (IISWC), 2016.Google ScholarCross Ref
- Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita R Das. Managing GPU concurrency in heterogeneous architectures. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
- Sandia National Laboratories. 2007. LAMMPS Molecular Dynamics Simulator. (2007). http://lammps.sandia.gov/.Google Scholar
- Xiaoyao Liang, Gu-Yeon Wei, and David Brooks. Revival: A variation-tolerant architecture using voltage interpolation and variable latency. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2008. Google ScholarDigital Library
- Xu Liu and John Mellor-Crummey. A data-centric profiler for parallel programs. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2013. Google ScholarDigital Library
- Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In ACM Sigplan Notices, 2005. Google ScholarDigital Library
- Raman Manikantan, Kaushik Rajan, and Ramaswamy Govindarajan. Probabilistic shared cache management (PriSM). In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- Yandong Mao, Eddie Kohler, and Robert Tappan Morris. Cache craftiness for fast multicore key-value storage. In ACM Proceedings of the European Conference on Computer Systems (EuroSys), 2012. Google ScholarDigital Library
- Jaydeep Marathe, Frank Mueller, Tushar Mohan, Bronis R de Supinski, Sally A McKee, and Andy Yoo. METRIC: Tracking down inefficiencies in the memory hierarchy via binary rewriting. In International Symposium on Code Generation and Optimization (CGO), 2003. Google ScholarDigital Library
- Anurag Mukkara, Nathan Beckmann, and Daniel Sanchez. Whirlpool: Improving dynamic cache management with static data classification. In ACM Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016. Google ScholarDigital Library
- Richard C Murphy and Peter M Kogge. 2007. On the memory access patterns of supercomputer applications: Benchmark selection and its implications. IEEE Trans. Comput. (2007). Google ScholarDigital Library
- Arun Arvind Nair, Stijn Eyerman, Lieven Eeckhout, and Lizy Kurian John. A first-order mechanistic model for architectural vulnerability factor. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- NASA. 2007. The NAS Parallel Benchmarks. (2007). https://www.nas.nasa.gov/publications/npb.html.Google Scholar
- U. D. of Energy. 2007. DOE exascale initiative technical roadmap. (2007). http://extremecomputing.labworks.org/hardware/collaboration/EI-RoadMapV21-SanDiego.pdf.Google Scholar
- Ivy Bo Peng, Roberto Gioiosa, Gokcen Kestor, Pietro Cicotti, Erwin Laure, and Stefano Markidis. RTHMS: A tool for data placement on hybrid memory system. In ACM Proceedings of the SIGPLAN International Symposium on Memory Management (ISMM), 2017. Google ScholarDigital Library
- Sokhom Pheng and Clark Verbrugge. Dynamic data structure analysis for Java programs. In IEEE Proceedings of the International Conference on Program Comprehension (ICPC), 2006. Google ScholarDigital Library
- Seth H Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers. In IEEE International Symposium on High Performance Computer Architecture (HPCA), 2014.Google Scholar
- Easwaran Raman and David I. August. Recursive data structure profiling. In ACM Proceedings of the Workshop on Memory System Performance (MSP), 2005. Google ScholarDigital Library
- Shai Rubin, Rastislav Bodík, and Trishul Chilimbi. An efficient profile-analysis framework for data-layout optimizations. In ACM Proceedings of the SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2002. Google ScholarDigital Library
- Matthew L Seidl and Benjamin G Zorn. 1997. Predicting references to dynamically allocated objects. University of Colorado Technical Report (1997).Google Scholar
- Vivek Seshadri, Abhishek Bhowmick, Onur Mutlu, Phillip B Gibbons, Michael A Kozuch, and Todd C Mowry. The dirty-block index. In ACM/IEEE Annual International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- Julian Shun and Guy E Blelloch. Ligra: A lightweight graph processing framework for shared memory. In ACM Sigplan Notices, 2013. Google ScholarDigital Library
- Julian Shun, Guy E Blelloch, Jeremy T Fineman, Phillip B Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announcement: The problem based benchmark suite. In ACM Proceedings of the Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012. Google ScholarDigital Library
- The Standard Performance Evaluation Corporation (SPEC). 2007. The SPEC benchmarks. (2007). http://www.spec.org/.Google Scholar
- TOP500. 2007. TOP500 Supercomputer Sites. (2007). http://www.top500.org/.Google Scholar
- Stephen Tu, Wenting Zheng, Eddie Kohler, Barbara Liskov, and Samuel Madden. Speedy transactions in multicore in-memory databases. In ACM Proceedings of the Symposium on Operating Systems Principles (SOSP), 2013. Google ScholarDigital Library
- Gwendolyn Voskuilen, Arun F. Rodrigues, and Simon D. Hammond. Analyzing allocation behavior for multi-level memory. In Proceedings of the International Symposium on Memory Systems (MEMSYS), 2016. Google ScholarDigital Library
- Chao Wang, Sudharshan S Vazhkudai, Xiaosong Ma, Fei Meng, Youngjae Kim, and Christian Engelmann. NVMalloc: Exposing an aggregate SSD store as a memory partition in extreme-scale machines. In IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2012. Google ScholarDigital Library
- Ruisheng Wang and Lizhong Chen. Futility scaling: High-associativity cache partitioning. In ACM/IEEE Annual International Symposium on Microarchitecture (MICRO), 2014. Google ScholarDigital Library
- Yijian Wang and David Kaeli. Profile-guided I/O partitioning. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2003. Google ScholarDigital Library
- Jonathan Weinberg, Michael O McCracken, Erich Strohmaier, and Allan Snavely. Quantifying locality in the memory access patterns of hpc applications. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2005. Google ScholarDigital Library
- Thomas F Wenisch, Stephen Somogyi, Nikolaos Hardavellas, Jangwoo Kim, Anastassia Ailamaki, and Babak Falsafi. 2005. Temporal streaming of shared memory. ACM SIGARCH Computer Architecture News (2005). Google ScholarDigital Library
- Qiang Wu, Artem Pyatakov, Alexey Spiridonov, Easwaran Raman, Douglas W. Clark, and David I. August. Exposing memory access regularities using object-relative memory profiling. In International Symposium on Code Generation and Optimization (CGO), 2004. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Proceedings of the Conference on Networked Systems Design and Implementation (NSDI), 2012. Google ScholarDigital Library
Recommendations
Temporal characterization of memory access behaviors in SPEC CPU2017 workloads: Analysis and synthesis
AbstractThe SPEC CPU2017 benchmark suite has received wide attention in both academia and industry. However, few work have studied the memory behaviors in SPEC CPU2017 workloads from a time dependence perspective. We run all SPEC CPU2017 benchmarks and ...
Highlights- Observing some interesting phenomena not seen in SPEC CPU2006 workloads.
- The correlation of access intervals in SPEC CPU2017 differs significantly from CPU2006.
- All Hurst estimates confirm the wide existence of self-similarity in ...
Understanding the trade-offs in multi-level cell ReRAM memory design
DAC '13: Proceedings of the 50th Annual Design Automation ConferenceResistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can ...
SPEC CPU2006 sensitivity to memory page sizes
SPEC CPU2006 is a compute-intensive industry standard benchmark suite published in August 2006. This paper characterizes the memory access behavior of SPEC CPU2006 running on IBM POWER5+ microprocessors. We measure the maximum and average memory usage ...
Comments