skip to main content
research-article

MN-MATE: Elastic Resource Management of Manycores and a Hybrid Memory Hierarchy for a Cloud Node

Authors Info & Claims
Published:03 August 2015Publication History
Skip Abstract Section

Abstract

Recent advent of manycore system increases needs for larger but faster memory hierarchy. Emerging next generation memories such as on-chip DRAM and nonvolatile memory (NVRAM) are promising candidates for replacement of DRAM-only main memory. Combined with the manycore trends, it gives an opportunity to rethink conventional resource management system with a memory hierarchy for a single cloud node. In an attempt to mitigate the energy and memory problems, we propose MN-MATE, an elastic resource management architecture for a single cloud node with manycores, on-chip DRAM, and large size of off-chip DRAM and NVRAM. In MN-MATE, the hypervisor places consolidated VMs and balances memory among them. Based on the monitored information about the allocated memory, a guest OS co-schedules tasks accessing different types of memory with complementary access intensity. Polymorphic management of DRAM hierarchy accelerates average memory access speed inside each guest OS. A guest OS reduces energy consumption with small performance loss based on the NVRAM-aware data placement policy and the hybrid page cache. A new lightweight kernel is developed to reduce the overhead from the guest OS for scientific applications. Experiment results show that our techniques in MN-MATE platform improve system performance and reduce energy consumption.

References

  1. AMD. 2013. BIOS and Kernel Developer's Guide (BKDG) for AMD Family 15h. Tech. Document.Google ScholarGoogle Scholar
  2. Sorav Bansal and Dharmendra S. Modha. 2004. CAR: Clock with adaptive replacement. In Proceedings of the 3rd USENIX Conference on File and Storage Technologies. USENIX Association, 187--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. S. Blagodurov, S. Zhuravlev, M. Dashti, and A. Fedorova. 2011. A case for NUMA-aware contention management on multicore processors. In Proceedings of the Usenix Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sergey Blagodurov, Sergey Zhuravlev, Alexandra Fedorova, and Ali Kamali. 2010. A case for NUMA-aware contention management on multicore systems. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT'10). ACM, New York, 557--558. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Martin J. Bligh, Matt Dobson, Darren Hart, and Gerrit Huizenga. 2004. Linux on NUMA systems. In Proceedings of the Linux Symposium, Vol. 1. 89--102.Google ScholarGoogle Scholar
  6. H. Bouwmeester, M. Jacquelin, J. Langou, and Y. Robert. 2011. Tiled QR factorization algorithms. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Alfredo Buttari, Julien Langou, Jakub Kurzak, and Jack Dongarra. 2009. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35, 1, 38--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Richard W. Carr and John L. Hennessy. 1981. WSCLOCK&Mdash: A simple and effective algorithm for virtual memory management. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP'81). ACM, New York, 87--95. DOI:http://dx.doi.org/10.1145/800216.806596 Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. J. Corbato and MIT Cambridge Project MAC. 1968. A Paging Experiment with the Multics System. Defense Technical Information Center.Google ScholarGoogle Scholar
  10. Asit Dan and Don Towsley. 1990. An approximate analysis of the LRU and FIFO buffer replacement schemes. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. 143--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Xiangyu Dong, Yuan Xie, Naveen Muralimanohar, and Norman P. Jouppi. 2010. Simple but effective heterogeneous main memory with on-chip memory controller support. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC'10). IEEE, 1--11. DOI:http://dx.doi.org/10.1109/SC.2010.50 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kinshuk Govil, Dan Teodosiu, Yongqiang Huang, and Mendel Rosenblum. 1999. Cellular disco: Resource management using virtual clusters on shared-memory multiprocessors. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP'99). ACM, 154--169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Iyer. 2003. Performance implications of chipset caches in web servers. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 176--185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Song Jiang, Feng Chen, and Xiaodong Zhang. 2005. CLOCK-Pro: An effective improvement of the CLOCK replacement. In Proceedings of the USENIX Annual Technical Conference (ATEC'05). USENIX Association, Berkeley, CA, 35--35. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Song Jiang and Xiaodong Zhang. 2002. LIRS: An efficient low inter-reference recency set replacement policy to improve buffer cache performance. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, 31--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Xiaowei Jiang, N. Madan, Li Zhao, M. Upton, R. Iyer, S. Makineni, D. Newell, Y. Solihin, and R. Balasubramonian. 2010. CHOP: Adaptive filter-based DRAM caching for CMP server platforms. In Proceedings of the IEEE 16th International Symposium on High Performance Computer Architecture. 1--12. DOI:http://dx.doi.org/10.1109/HPCA.2010.5416642Google ScholarGoogle Scholar
  17. Theodore Johnson and Dennis Shasha. 1994. 2Q: A low overhead high performance buffer management replacement algorithm. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94). Morgan Kaufmann, San Francisco, CA, 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Taeho Kgil, Shaun D'Souza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Trevor Mudge, Steven Reinhardt, and Krisztian Flautner. 2006. PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 117--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Kangho Kim, Cheiyol Kim, Sung-In Jung, Hyun-Sup Shin, and Jin-Soo Kim. 2008. Inter-domain socket communications supporting high performance and full binary compatibility on Xen. In Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE'08). ACM, New York, 11--20. DOI:http://dx.doi.org/10.1145/1346256.1346259 Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger. 2009. Architecting phase change memory as a scalable dram alternative. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM, New York, 2--13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, and C. S. Kim. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies. IEEE Trans. Comput. 50, 12, 1352--1361. DOI:http://dx.doi.org/10.1109/TC.2001.970573 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Christianto C. Liu, Ilya Ganusov, Martin Burtscher, and Sandip Tiwari. 2005. Bridging the processor-memory performance gap with 3D IC technology. IEEE Des. Test 22, 6, 556--564. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy Sherwood, and Kaustav Banerjee. 2006. A thermally-aware performance analysis of vertically integrated (3-d) processor-memory hierarchy. In Proceedings of the 43rd Annual Design Automation Conference (DAC'06). ACM, 991--996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Pin Lu and Kai Shen. 2007. Virtual machine memory access tracing with hypervisor exclusive cache. In Proceedings of the USENIX Annual Technical Conference (ATC'07). USENIX Association, 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chi-Keung Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Vijay Janapa Reddi, and Kim Hazelwood. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'05). ACM, New York, 190--200. DOI:http://dx.doi.org/10. 1145/1065010.1065034 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Dan Magenheimer. 2008. Memory Overcommit... without the Commitment. Xen Summit.Google ScholarGoogle Scholar
  27. Dan Magenheimer. 2009. Transcendent Memory on Xen. Xen Summit.Google ScholarGoogle Scholar
  28. Z. Majo and T. R. Gross. 2011. Memory management in NUMA multicore systems: Trapped between cache contention and interconnect overhead. In Proceedings of the International Symposium on Memory Management. ACM, 11--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan. 2012. Enabling efficient and scalable hybrid memories using fine-granularity dram cache management. IEEE Comput. Archit. Lett. 11, 2, 61--64. DOI:http://dx.doi.org/10.1109/L-CA.2012.2 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. G. Nimako, E. J. Otoo, and D. Ohene-Kwofie. 2012. Fast parallel algorithms for blocked dense matrix multiplication on shared memory architectures. In Proceedings of the 12th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP'12): Part I. Springer, 443--457. DOI:http://dx.doi.org/10.1007/978-3-642-33078-0_32 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Elizabeth J. O'Neil, Patrick E. O'Neil, and Gerhard Weikum. 1993. The LRU-K Page replacement algorithm for database disk buffering. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD'93). ACM, New York, 297--306. DOI:http://dx.doi.org/10.1145/170035. 170081 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Hyunsun Park, Sungjoo Yoo, and Sunggu Lee. 2011b. Power management of hybrid DRAM/PRAM-based main memory. In Proceedings of the 48th ACM/EDAC/IEEE Design Automation Conference. 59--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kyu Ho Park, Sung Kyu Park, Woomin Hwang, Hyunchul Seok, Dong-Jae Shin, and Ki-Woong Park. 2012a. Resource management of manycores with a hierarchical and a hybrid main memory for MN-MATE cloud node. In Proceedings of the 8th IEEE World Congress on Services. 301--308. DOI:http://dx.doi.org/10.1109/SERVICES.2012.26 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Kyu Ho Park, Sung Kyu Park, Hyunchul Seok, Woomin Hwang, Dong-Jae Shin, Jong Hun Choi, and Ki-Woong Park. 2012b. Efficient memory management of a hierarchical and a hybrid main memory for MN-MATE platform. In Proceedings of the International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM'12). ACM, New York, 83--92. DOI:http://dx.doi.org/10.1145/2141702.2141712 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kyu Ho Park, Youngwoo Park, Woomin Hwang, and Ki-Woong Park. 2010b. MN-Mate: Resource management of manycores with DRAM and nonvolatile memories. In Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications. 24--34. DOI:http://dx.doi.org/10.1109/HPCC.2010.35 Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Youngwoo Park, Sung Kyu Park, and Kyu Ho Park. 2010a. Linux kernel support to exploit phase change memory. In Proceedings of the Linux Symposium. 217--224.Google ScholarGoogle Scholar
  37. Youngwoo Park, Dong-Jae Shin, Sung Kyu Park, and Kyu-Ho Park. 2011a. Power-aware memory management for hybrid main memory. In Proceedings of the 2nd International Conference on Next Generation Information Technology. 82--85.Google ScholarGoogle Scholar
  38. Moinuddin K. Qureshi, Vijayalakshmi Srinivasan, and Jude A. Rivers. 2009. Scalable high performance main memory system using phase-change memory technology. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM, 24--33. DOI:http://dx.doi.org/10.1145/1555754.1555760 Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Luiz E. Ramos, Eugene Gorbatov, and Ricardo Bianchini. 2011. Page placement in hybrid memory systems. In Proceedings of the International Conference on Supercomputing (ICS'11). ACM, New York, 85--95. DOI:http://dx.doi.org/10.1145/1995896.1995911 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. John T. Robinson and Murthy V. Devarakonda. 1990. Data cache management using frequency-based replacement. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (SIGMETRICS'90). ACM, New York, 134--142. DOI:http://dx.doi.org/10.1145/98457.98523 Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Martin Schwidefsky, Hubertus Franke, Ray Mansell, Damian Osisek, Himanshu Raj, and Jonghyuk Choi. 2006. Collaborative memory management in hosted linux systems. In Proceedings of the Ottawa Linux Symposium.Google ScholarGoogle Scholar
  42. Dong-Jae Shin, Sung Kyu Park, Seong Min Kim, and Kyu Ho Park. 2012. Adaptive page grouping for energy efficiency in hybrid PRAM-DRAM main memory. In Proceedings of the ACM Research in Applied Computation Symposium. ACM, 395--402. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Allan Snavely and Dean M. Tullsen. 2000. Symbiotic job scheduling for a simultaneous multithreaded processor. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM, 234--244. DOI:http://dx.doi.org/10.1145/378993.379244 Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. SPEC. 2012. Spec's benchmark. http://www.spec.org/cpu2006.Google ScholarGoogle Scholar
  45. UMass TraceRepository. 2007. OLTP Application I/O and Search Engine I/O. http://traces.cs.umass.edu/index.php/Storage/Storage.Google ScholarGoogle Scholar
  46. VMware. 2005. ESX Server 2 NUMA Support. WhitePaper, http://www.vmware.com/pdf/esx2_NUMA.pdf.Google ScholarGoogle Scholar
  47. Carl A. Waldspurger. 2002. Memory resource management in VMware ESX Server. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI'02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Dong Hyuk Woo, Nak Hee Seong, D. L. Lewis, and H.-H. S. Lee. 2010. An optimized 3D-stacked memory architecture by exploiting excessive, high-density TSV bandwidth. In Proceedings of the IEEE 16th International Symposium on High Performance Computer Architecture. 1--12. DOI:http://dx.doi.org/10. 1109/HPCA.2010.5416628Google ScholarGoogle ScholarCross RefCross Ref
  49. Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, and Yuan Xie. 2009. Hybrid cache architecture with disparate memory technologies. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA'09). ACM, New York, 34--45. DOI:http://dx.doi.org/10. 1145/1555754.1555761 Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Xiaolan Zhang, Suzanne McIntosh, Pankaj Rohatgi, and John Linwood Griffin. 2007. XenSocket: A high-throughput interdomain transport for virtual machines. In Proceedings of the ACM/IFIP/USENIX International Conference on Middleware (Middleware'07). Springer, 184--203. http://dl.acm.org/citation.cfm?id=1516124.1516138 Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Zhao Zhang, Zhichun Zhu, and Xiaodong Zhang. 2004. Design and optimization of large size and low overhead off-chip caches. IEEE Trans. Comput. 53, 7, 843--855. DOI:http://dx.doi.org/10.1109/TC.2004.27 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Li Zhao, R. Iyer, R. Illikkal, and D. Newell. 2007. Exploring DRAM cache architectures for CMP server platforms. In Proceedings of the 25th International Conference onComputer Design (ICCD'07). 55--62. DOI:http://dx.doi.org/10.1109/ICCD.2007.4601880Google ScholarGoogle Scholar
  53. Weiming Zhao and Zhenlin Wang. 2009. Dynamic memory balancing for virtual machines. In Proceedings of the ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE'09). ACM, 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Yuanyuan Zhou, James Philbin, and Kai Li. 2001. The multi-queue replacement algorithm for second level buffer caches. In Proceedings of the General Track: 2002 USENIX Annual Technical Conference. USENIX Association, Berkeley, CA, 91--104. http://dl.acm.org/citation.cfm?id=647055.715773 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Sergey Zhuravlev, Sergey Blagodurov, and Alexandra Fedorova. 2010. Addressing shared resource contention in multicore processors via scheduling. In Proceedings of the 15th Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'10). ACM, New York, 129--142. DOI:http://dx.doi.org/10.1145/1736020.1736036 Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. MN-MATE: Elastic Resource Management of Manycores and a Hybrid Memory Hierarchy for a Cloud Node

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Journal on Emerging Technologies in Computing Systems
      ACM Journal on Emerging Technologies in Computing Systems  Volume 12, Issue 1
      July 2015
      210 pages
      ISSN:1550-4832
      EISSN:1550-4840
      DOI:10.1145/2810396
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 3 August 2015
      • Accepted: 1 December 2014
      • Revised: 1 July 2014
      • Received: 1 March 2014
      Published in jetc Volume 12, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader