Abstract
We propose a novel online method of identifying the preferred NUMA nodes for objects with negligible overhead during the garbage collection time as well as object allocation time. Since the number of CPUs (or NUMA nodes) is increasing recently, it is critical for the memory manager of the runtime environment of an object-oriented language to exploit the low latency of local memory for high performance. To locate the CPU of a thread that frequently accesses an object, prior research uses the runtime information about memory accesses as sampled by the hardware. However, the overhead of this approach is high for a garbage collector.
Our approach uses the information about which thread can exclusively access an object, or the Dominant Thread (DoT). The dominant thread of an object is the thread that often most accesses an object so that we do not require memory access samples. Our NUMA-aware GC performs DoT based object copying, which copies each live object to the CPU where the dominant thread was last dispatched before GC. The dominant thread information is known from the thread stack and from objects that are locked or reserved by threads and is propagated in the object reference graph.
We demonstrate that our approach can improve the performance of benchmark programs such as SPECpower ssj2008, SPECjbb2005, and SPECjvm2008.We prototyped a NUMAaware memory manager on a modified version of IBM Java VM and tested it on a cc-NUMA POWER6 machine with eight NUMA nodes. Our NUMA-aware GC achieved performance improvements up to 14.3% and 2.0% on average over a JVM that only used the NUMA-aware allocator. The total improvement using both the NUMA-aware allocator and GC is up to 53.1% and 10.8% on average.
- Advanced Micro Devices, Inc. Performance guidelines for AMD AthlonTM64 and AMD OpteronTMccNUMA multiprocessor systems, 2006.Google Scholar
- David F. Bacon, Ravi Konuru, Chet Murthy, and Mauricio Serrano. Thin locks: featherweight synchronization for Java. In PLDI '98: Proc. ACM SIGPLAN 1998 Conf. on Programming language design and implementation, pages 258--268, 1998. Google ScholarDigital Library
- BEA Systems, Inc. BEA JRockit, r27.5, commandline reference. http://e-docs.bea.com/jrockit/jrdocs/pdf/refman.pdf, January 2008.Google Scholar
- Martin J. Bligh, Matt Dobson, Darren Hart, and Gerrit Huizenga. Linux on NUMA systems. In Proceedings of the Linux Symposium, pages 295--306, 2004.Google Scholar
- B. Brock, G. Carpenter, E. Chiprout, E. Elnozahy, M. Dean, D. Glasco, J. Peterson, R. Rajamony, F. Rawson, R. Rockhold, and A. Zimmerman. Windows NT in a ccNUMA system. In WINSYM'99: Proceedings of the 3rd conference on USENIX Windows NT Symposium, pages 7--7, Berkeley, CA, USA, 1999. USENIX Association. Google ScholarDigital Library
- Myra Cohen, Shiu Beng Kooi, and Witawas Srisa-an. Clustering the heap in multi-threaded applications for improved garbage collection. In GECCO '06: Proceedings of the 8th annual conference on Genetic and evolutionary computation, pages 1901--1908, New York, NY, USA, 2006. ACM. ISBN 1-59593-186-4. doi: http://doi.acm.org/10.1145/1143997.1144314. Google ScholarDigital Library
- IBM Corp. IBM @server pSeries 650 performance and tuning. http://www-03.ibm.com/systems/resources/systems_p_hardware_whitepapers_p650_perf.pdf, 2002.Google Scholar
- IBM Corp. AIX 5L differences guide version 5.3 edition. http://www.redbooks.ibm.com/redpieces/pdfs/sg247463.pdf, 2004.Google Scholar
- David Dagastine and Paul Hohensee. High performance Java technology in a multi-core world. http://developers.sun.com/learning/javaoneonline/2007/pdf/TS-2885.pdf, 2007.Google Scholar
- Robert Dimpsey, Rajiv Arora, and Kean Kuiper. Java server performance: a case study of building efficient, stable JVMs. IBM Systems Journal, 39(1):151--174, 2000. Google ScholarDigital Library
- Tamar Domani, Gal Goldshtein, Elliot K. Kolodner, Ethan Lewis, Erez Petrank, and Dafna Sheinwald. Thread-local heaps for Java. In ISMM '02: Proceedings of the 3rd International Symposium on Memory Management, pages 76--87, New York, NY, USA, 2002. ACM. ISBN 1-58113-539-4. doi: http://doi.acm.org/10.1145/512429.512439. Google ScholarDigital Library
- Nikola Grcevski. Effective method for Java lock reservation for Java virtual machines that have cooperative multithreading. In 6th Workshop on Compiler-Driven Performance, Associated with CASCON2007, 2007.Google Scholar
- Nikola Grcevski, Allan Kielstra, Kevin Stoodley, Mark G. Stoodley, and Vijay Sundaresan. Java just-intime compiler and virtual machine improvements for server and middleware applications. In Proc. the 3rd Virtual Machine Research and Technology Symposium, pages 151--162, 2004. Google ScholarDigital Library
- Martin Hirzel. Data layouts for object-oriented programs. In SIGMETRICS '07: Proceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, pages 265--276, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-639-4. doi: http://doi.acm.org/10.1145/1254882.1254915. Google ScholarDigital Library
- IBM Corp. IBM developer kit and runtime environment, Java technology edition, version 6 diagnostics guide. http://download.boulder.ibm.com/ibmdl/pub/software/dw/jdk/diagnosis/diag60.pdf.Google Scholar
- IBM Corp. WebSphere Application Server Network Deployment, version 6.1, tuning Java virtual machines. http://publib.boulder.ibm.com/infocenter/wasinfo/v6r1/index.jsp?topic=/com.ibm.websphere.nd.doc/info/ae/ae/tprf_tunejvm_v61.html.Google Scholar
- Intel Corp. Intel 64 and IA-32 architectures optimization reference manual, 2009.Google Scholar
- Kiyokuni Kawachiya, Akira Koseki, and Tamiya Onodera. Lock reservation: Java locks can mostly do without atomic operations. In Proc. ACM SIGPLAN Conf. on Object-Oriented Programming Systems, Languages and Applications, pages 130--141, 2002. Google ScholarDigital Library
- Hung Q. Le,William J. Starke, J. Stephen Fields, Francis P. O'Connell, Dung Q. Nguyen, Bruce J. Ronchetti, Wolfram M. Sauer, Eric M. Schwarz, and Michael T. (Mike) Vaden. IBM POWER6 microarchitecture. IBM J. Res. Dev., 51(6):639--662, 2007. ISSN 0018-8646. Google ScholarDigital Library
- Kyungwoo Lee and Samuel P. Midkiff. A two-phase escape analysis for parallel Java programs. In PACT'06: Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques, pages 53--62, New York, NY, USA, 2006. ACM. ISBN 1-59593-264-X. doi: http://doi.acm.org/10.1145/1152154.1152166. Google ScholarDigital Library
- Kyungwoo Lee, Xing Fang, and Samuel P. Midkiff. Practical escape analyses: how good are they? In VEE '07: Proceedings of the 3rd International Conference on Virtual Execution Environments, pages 180--190, New York, NY, USA, 2007. ACM. ISBN 978-1-59593-630-1. doi: http://doi.acm.org/10.1145/1254810.1254836. Google ScholarDigital Library
- Henry Lieberman and Carl Hewitt. A real-time garbage collector based on the lifetimes of objects. Commun. ACM, 26(6):419--429, 1983. ISSN 0001-0782. doi: http://doi.acm.org/10.1145/358141.358147. Google ScholarDigital Library
- Jaydeep Marathe and Frank Mueller. Hardware profileguided automatic page placement for ccnuma systems. In PPoPP '06: Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 90--99, New York, NY, USA, 2006. ACM. ISBN 1-59593-189-9. doi: http://doi.acm.org/10.1145/1122971.1122987. Google ScholarDigital Library
- Takeshi Ogasawara, Hideaki Komatsu, and Toshio Nakatani. TO-lock: Removing lock overhead using the owners' temporal locality. In PACT '04: Proc. 13th International Conf. on Parallel Architectures and Compilation Techniques, pages 255--266, 2004. Google ScholarDigital Library
- Mattias Persson. Java technology, IBM style: Garbage collection policies, part 1. http://http://www.ibm.com/developerworks//library/j-ibmjava2/, 2006.Google Scholar
- Kenneth Russell and David Detlefs. Eliminating synchronization-related atomic operations with biased locking and bulk rebiasing. In Proc. ACM SIGPLAN Conf. on Object-Oriented Programming Systems, Languages and Applications, pages 263--272, 2006. Google ScholarDigital Library
- Ryan Sciampacone. A production world view of garbage collection for the Java ME and SE spaces. http://www.mm-net.org.uk/school/programme.html.Google Scholar
- Yefim Shuf, Manish Gupta, Rajesh Bordawekar, and Jaswinder Pal Singh. Exploiting prolific types for memory management and optimizations. In POPL '02: Proceedings of the 29th ACM SIGPLAN-SIGACT symposium on Principles of programming languages, pages 295--306, New York, NY, USA, 2002. ACM. ISBN 1-58113-450-9. doi: http://doi.acm.org/10.1145/503272.503300. Google ScholarDigital Library
- Bjarne Steensgaard. Thread-specific heaps for multithreaded programs. In ISMM '00: Proceedings of the 2nd International Symposium on Memory Management, pages 18--24, New York, NY, USA, 2000. ACM. ISBN 1-58113-263-8. doi: http://doi.acm.org/10.1145/362422.362432. Google ScholarDigital Library
- Sun Microsystems, Inc. Ergonomics in the 5.0 JavaTMvirtual machine. http://java.sun.com/docs/hotspot/gc5.0/ergo5.html.Google Scholar
- Sun Microsystems, Inc. SolarisTMmemory placement optimization and Sun FireTMservers. http://www.sun.com/servers/wp/docs/mpo_v7_CUSTOMER.pdf, 2003.Google Scholar
- Sun Microsystems, Inc. Java SE 6 HotSpotTMvirtual machine garbage collection tuning. http://java.sun.com/javase/technologies/hotspot/gc/gc_tuning_6.html, 2008.Google Scholar
- Sun Microsystems, Inc. Sun SPARC R enterprise T5440 server architecture, 2008.Google Scholar
- Christian Terboven, Dieter an Mey, Dirk Schmidl, Henry Jin, and Thomas Reichstein. Data and thread affinity in openmp programs. In MAW '08: Proceedings of the 2008 workshop on Memory access on future processors, pages 377--384, New York, NY, USA, 2008. ACM. ISBN 978-1-60558-091-3. doi: http://doi.acm.org/10.1145/1366219.1366222. Google ScholarDigital Library
- Mustafa M. Tikir and Jeffery K. Hollingsworth. NUMA-aware Java heaps for server applications. In IPDPS '05: Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS'05) - Papers, page 108.2, Washington, DC, USA, 2005. IEEE Computer Society. ISBN 0-7695-2312-9. doi: http://dx.doi.org/10.1109/IPDPS.2005.299. Google ScholarDigital Library
- Mustafa M. Tikir and Jeffrey K. Hollingsworth. Using hardware counters to automatically improve memory performance. In SC '04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 46, Washington, DC, USA, 2004. IEEE Computer Society. ISBN 0-7695-2153-3. doi: http://dx.doi.org/10.1109/SC.2004.64. Google ScholarDigital Library
- David Ungar. Generation scavenging: A nondisruptive high performance storage reclamation algorithm. In SDE 1: Proceedings of the first ACM SIGSOFT/SIGPLAN software engineering symposium on Practical software development environments, pages 157--167, New York, NY, USA, 1984. ACM. ISBN 0-89791-131-8. doi: http://doi.acm.org/10.1145/800020. 808261. Google ScholarDigital Library
Index Terms
- NUMA-aware memory manager with dominant-thread-based copying GC
Recommendations
NUMA-aware memory manager with dominant-thread-based copying GC
OOPSLA '09: Proceedings of the 24th ACM SIGPLAN conference on Object oriented programming systems languages and applicationsWe propose a novel online method of identifying the preferred NUMA nodes for objects with negligible overhead during the garbage collection time as well as object allocation time. Since the number of CPUs (or NUMA nodes) is increasing recently, it is ...
The pauseless GC algorithm
VEE '05: Proceedings of the 1st ACM/USENIX international conference on Virtual execution environmentsModern transactional response-time sensitive applications have run into practical limits on the size of garbage collected heaps. The heap can only grow until GC pauses exceed the response-time limits. Sustainable, scalable concurrent collection has ...
Topology-Aware Parallelism for NUMA Copying Collectors
LCPC 2015: Revised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing - Volume 9519NUMA-aware parallel algorithms in runtime systems attempt to improve locality by allocating memory from local NUMA nodes. Researchers have suggested that the garbage collector should profile memory access patterns or use object locality heuristics to ...
Comments