ABSTRACT
Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.
- M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT'10. Google ScholarDigital Library
- M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In SC'08. Google ScholarDigital Library
- S. Blagodurov, S. Zhuravlev, and A. Fedorova. Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst., 2010. Google ScholarDigital Library
- D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA'05. Google ScholarDigital Library
- S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC'05. Google ScholarDigital Library
- A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT'07. Google ScholarDigital Library
- D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO 42, 2009. Google ScholarDigital Library
- A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based QoS techniques for cache/memory in CMP platforms. In ICS'09. Google ScholarDigital Library
- Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual, January 2011.Google Scholar
- Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08. Google ScholarDigital Library
- R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarDigital Library
- D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys'10. Google ScholarDigital Library
- H. Li, H. L. Sudarsan, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In ICPP'93. Google ScholarDigital Library
- T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In SC'07. Google ScholarDigital Library
- Z. Majo and T. R. Gross. Memory system performance in a NUMA multicore multiprocessor. In SYSTOR'11. Google ScholarDigital Library
- J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In PPoPP'06. Google ScholarDigital Library
- J. Mars, L. Tang, and M. L. Soffa. Directly characterizing cross core interference through contention synthesis. In HiPEAC'11. Google ScholarDigital Library
- J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In CGO'10. Google ScholarDigital Library
- D. Molka, D. Hackenberg, R. Schöne, and M. S. Müller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT'09. Google ScholarDigital Library
- T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In ASPLOS'09. Google ScholarDigital Library
- T. Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA'09. Google ScholarDigital Library
- M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39, 2006. Google ScholarDigital Library
- J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore processors. In EuroSys'10. Google ScholarDigital Library
- A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10. Google ScholarDigital Library
- D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In ASPLOS '09. Google ScholarDigital Library
- M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing, 2008. Google ScholarDigital Library
- B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS'96. Google ScholarDigital Library
- S. Zhuralev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10. Google ScholarDigital Library
Index Terms
- Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
Recommendations
A case for NUMA-aware contention management on multicore systems
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniquesOn multicore systems contention for shared resources occurs when memory-intensive threads are co-scheduled on cores that share parts of the memory hierarchy, such as last-level caches and memory controllers. Previous work investigated how contention ...
Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead
ISMM '11Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, ...
Memory system performance in a NUMA multicore multiprocessor
SYSTOR '11: Proceedings of the 4th Annual International Conference on Systems and StorageModern multicore processors with an on-chip memory controller form the base for NUMA (non-uniform memory architecture) multiprocessors. Each processor accesses part of the physical memory directly and has access to the other parts via the memory ...
Comments