skip to main content
10.1145/1993478.1993481acmconferencesArticle/Chapter ViewAbstractPublication PagesismmConference Proceedingsconference-collections
research-article

Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

Published:04 June 2011Publication History

ABSTRACT

Multiprocessors based on processors with multiple cores usually include a non-uniform memory architecture (NUMA); even current 2-processor systems with 8 cores exhibit non-uniform memory access times. As the cores of a processor share a common cache, the issues of memory management and process mapping must be revisited. We find that optimizing only for data locality can counteract the benefits of cache contention avoidance and vice versa. Therefore, system software must take both data locality and cache contention into account to achieve good performance, and memory management cannot be decoupled from process scheduling. We present a detailed analysis of a commercially available NUMA-multicore architecture, the Intel Nehalem. We describe two scheduling algorithms: maximum-local, which optimizes for maximum data locality, and its extension, N-MASS, which reduces data locality to avoid the performance degradation caused by cache contention. N-MASS is fine-tuned to support memory management on NUMA-multicores and improves performance up to 32%, and 7% on average, over the default setup in current Linux implementations.

References

  1. M. Awasthi, D. W. Nellans, K. Sudan, R. Balasubramonian, and A. Davis. Handling the problems and opportunities posed by multiple on-chip memory controllers. In PACT'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Banikazemi, D. Poff, and B. Abali. PAM: a novel performance/power aware meta-scheduler for multi-core systems. In SC'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Blagodurov, S. Zhuravlev, and A. Fedorova. Contention-aware scheduling on multicore systems. ACM Trans. Comput. Syst., 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cache contention on a chip multi-processor architecture. In HPCA'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Eyerman and L. Eeckhout. System-level performance metrics for multiprogram workloads. IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Fedorova, M. Seltzer, C. Small, and D. Nussbaum. Performance of multithreaded chip multiprocessors and implications for operating system design. In ATEC'05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Fedorova, M. Seltzer, and M. D. Smith. Improving performance isolation on chip multiprocessors via an operating system scheduler. In PACT'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Hackenberg, D. Molka, and W. E. Nagel. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In MICRO 42, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Herdrich, R. Illikkal, R. Iyer, D. Newell, V. Chadha, and J. Moses. Rate-based QoS techniques for cache/memory in CMP platforms. In ICS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Intel Corporation. Intel® 64 and IA-32 Architectures Optimization Reference Manual, January 2011.Google ScholarGoogle Scholar
  11. Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approximation of optimal co-scheduling on chip multiprocessors. In PACT'08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Knauerhase, P. Brett, B. Hohlt, T. Li, and S. Hahn. Using OS observations to improve performance in multicore systems. IEEE Micro, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Koufaty, D. Reddy, and S. Hahn. Bias scheduling in heterogeneous multi-core architectures. In EuroSys'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Li, H. L. Sudarsan, M. Stumm, and K. C. Sevcik. Locality and loop scheduling on NUMA multiprocessors. In ICPP'93. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Li, D. Baumberger, D. A. Koufaty, and S. Hahn. Efficient operating system scheduling for performance-asymmetric multi-core architectures. In SC'07. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Majo and T. R. Gross. Memory system performance in a NUMA multicore multiprocessor. In SYSTOR'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Marathe and F. Mueller. Hardware profile-guided automatic page placement for ccNUMA systems. In PPoPP'06. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Mars, L. Tang, and M. L. Soffa. Directly characterizing cross core interference through contention synthesis. In HiPEAC'11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Mars, N. Vachharajani, M. L. Soffa, and R. Hundt. Contention aware execution: Online contention detection and response. In CGO'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Molka, D. Hackenberg, R. Schöne, and M. S. Müller. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In PACT'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In ASPLOS'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Ogasawara. NUMA-aware memory manager with dominant-thread-based copying GC. In OOPSLA'09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. K. Qureshi and Y. N. Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In MICRO 39, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov. A comprehensive scheduler for asymmetric multicore processors. In EuroSys'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Sandberg, D. Eklöv, and E. Hagersten. Reducing cache pollution through detection and elimination of non-temporal memory accesses. In SC'10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. K. Tam, R. Azimi, L. B. Soares, and M. Stumm. RapidMRC: approximating L2 miss rate curves on commodity systems for online optimizations. In ASPLOS '09. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. M. Tikir and J. K. Hollingsworth. Hardware monitors for dynamic page migration. Journal of Parallel and Distributed Computing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B. Verghese, S. Devine, A. Gupta, and M. Rosenblum. Operating system support for improving data locality on CC-NUMA compute servers. In ASPLOS'96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. Zhuralev, S. Blagodurov, and A. Fedorova. Addressing shared resource contention in multicore processors via scheduling. In ASPLOS'10. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Memory management in NUMA multicore systems: trapped between cache contention and interconnect overhead

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ISMM '11: Proceedings of the international symposium on Memory management
      June 2011
      148 pages
      ISBN:9781450302630
      DOI:10.1145/1993478
      • cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 46, Issue 11
        ISMM '11
        November 2011
        135 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2076022
        Issue’s Table of Contents

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 June 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate72of156submissions,46%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader