research-article

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Authors:
Zoltan Majo

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

,
Thomas R. Gross

ETH Zurich, Zurich, Switzerland

ETH Zurich, Zurich, Switzerland
View Profile

ISMM '11: Proceedings of the international symposium on Memory managementJune 2011Pages 11–20https://doi.org/10.1145/1993478.1993481

Published:04 June 2011Publication History

ISMM '11: Proceedings of the international symposium on Memory management

Pages 11–20

ABSTRACT

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

References

M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT'10. Google ScholarDigital Library
M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In SC'08. Google ScholarDigital Library
S. Blagodurov, S. Zhuravlev, and A. Fedorova. Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst., 2010. Google ScholarDigital Library
D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA'05. Google ScholarDigital Library
S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google ScholarDigital Library
A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC'05. Google ScholarDigital Library
A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT'07. Google ScholarDigital Library
D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO 42, 2009. Google ScholarDigital Library
A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based QoS techniques for cache/memory in CMP platforms. In ICS'09. Google ScholarDigital Library
Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual, January 2011.Google Scholar
Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08. Google ScholarDigital Library
R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarDigital Library
D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys'10. Google ScholarDigital Library
H. Li, H. L. Sudarsan, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In ICPP'93. Google ScholarDigital Library
T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In SC'07. Google ScholarDigital Library
Z. Majo and T. R. Gross. Memory system performance in a NUMA multicore multiprocessor. In SYSTOR'11. Google ScholarDigital Library
J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In PPoPP'06. Google ScholarDigital Library
J. Mars, L. Tang, and M. L. Soffa. Directly characterizing cross core interference through contention synthesis. In HiPEAC'11. Google ScholarDigital Library
J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In CGO'10. Google ScholarDigital Library
D. Molka, D. Hackenberg, R. Schöne, and M. S. Müller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT'09. Google ScholarDigital Library
T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In ASPLOS'09. Google ScholarDigital Library
T. Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA'09. Google ScholarDigital Library
M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39, 2006. Google ScholarDigital Library
J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore processors. In EuroSys'10. Google ScholarDigital Library
A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10. Google ScholarDigital Library
D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In ASPLOS '09. Google ScholarDigital Library
M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing, 2008. Google ScholarDigital Library
B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS'96. Google ScholarDigital Library
S. Zhuralev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10. Google ScholarDigital Library

Index Terms

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Process management
        Scheduling

Recommendations

A case for NUMA-aware contention management on multicore systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

On multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention ...
Read More
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Read More
Memory system performance in a NUMA multicore multiprocessor
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and Storage

Modern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISMM '11: Proceedings of the international symposium on Memory management
June 2011
148 pages
ISBN:9781450302630
DOI:10.1145/1993478
General Chair:
Hans-J. Boehm
HP Labs
,
Program Chair:
David Bacon
IBM T.J. Watson Research
ACM SIGPLAN Notices Volume 46, Issue 11
ISMM '11
November 2011
135 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2076022
Issue’s Table of Contents
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
memory allocation
multicore processors
numa
shared resource contention
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate72of156submissions,46%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 92
  Total Citations
  View Citations
- 1,317
  Total Downloads
- Downloads (Last 12 months)42
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

ISMM '11: Proceedings of the international symposium on Memory management

ABSTRACT

References

Cited By

Index Terms

Recommendations

A case for NUMA-aware contention management on multicore systems

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Memory system performance in a NUMA multicore multiprocessor