skip to main content
research-article
Open Access

Profiling a warehouse-scale computer

Published:13 June 2015Publication History
Skip Abstract Section

Abstract

With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.

We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

References

  1. David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In Operating systems principles (SOSP), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Jennifer Anderson, Lance Berc, George Chrysos, Jeffrey Dean, Sanjay Ghemawat, Jamey Hicks, Shun-Tak Leung, Mitch Lichtenberg, Mark Vandevoorde, Carl A Waldspurger, et al. Transparent, low-overhead profiling on modern processors. In Workshop on Profile and Feedback-Directed Compilation, 1998.Google ScholarGoogle Scholar
  3. Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. Call graph prefetching for database applications. Transactions of Computer Systems, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Paolo Calafiura, Stephane Eranian, David Levinthal, Sami Kama, and Roberto Agostino Vitillo. GOoDA: The generic optimization data analyzer. In Journal of Physics: Conference Series, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  7. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation (OSDI), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for FDO compilation. In Code generation and optimization (CGO), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Zefu Dai, Nick Ni, and Jianwen Zhu. A 1 cycle-per-byte XML parsing accelerator. In Field Programmable Gate Arrays, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Arnaldo Carvalho de Melo. The new linux 'perf' tools. In Slides from Linux Kongress, 2010.Google ScholarGoogle Scholar
  11. Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Filipa Duarte and Stephan Wong. Cache-based memory copy hardware accelerator for multicore systems. IEEE Transactions on Computers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E Smith. A top-down approach to architecting cpi component performance counters. IEEE Micro, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, and Anastasia Ailamaki. Clearing the clouds. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. B. Ferreira, R. Matias, A. Macedo, and L. B. Araujo. An experimental study on memory allocators in multicore and multithreaded applications. In Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google. Bazel. http://bazel.io/.Google ScholarGoogle Scholar
  17. Google. gRPC. http://grpc.io/.Google ScholarGoogle Scholar
  18. Google. Protocol buffers. https://developers.google.com/protocol-buffers/.Google ScholarGoogle Scholar
  19. John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Aamer Jaleel. Memory characterization of workloads using instrumentation-driven simulation--a Pin-based memory characterization of the SPEC CPU2000 and SPEC CPU2006 benchmark suites. Intel Corporation, VSSAD, 2007.Google ScholarGoogle Scholar
  21. Aamer Jaleel, Joseph Nuzman, Adrian Moga, Simon C Steely Jr, and Joel Emer. High Performing Cache Hierarchies for Server Workloads. In High-Performance Computer Architecture (HPCA), 2015.Google ScholarGoogle Scholar
  22. Vijay Janapa Reddi, Benjamin C Lee, Trishul Chilimbi, and Kushagra Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. Computer Architecture (ISCA), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, and Chunjie Luo. Characterizing data analysis workloads in data centers. In Workload characterization (IIWSC), 2013.Google ScholarGoogle ScholarCross RefCross Ref
  24. Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A Kim. Measuring interference between live datacenter applications. In High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Svilen Kanev, Kim Hazelwood, Gu-Yeon Wei, and David Brooks. Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications. In Workload Characterization (IISWC), 2014.Google ScholarGoogle ScholarCross RefCross Ref
  26. Aasheesh Kolli, Ali Saidi, and Thomas F. Wenisch. RDIP: Return-address-stack Directed Instruction Prefetching. In Microarchitecture (MICRO), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. Server engineering insights for large-scale online services. IEEE Micro, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Snehasish Kumar, Arrvindh Shriraman, Viji Srinivasan, Dan Lin, and Jordan Phillips. SQRL: Hardware Accelerator for Collecting Software Data Structures. In Parallel architectures and compilation (PACT), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sangho Lee, Teresa Johnson, and Easwaran Raman. Feedback directed optimization of tcmalloc. In Proceedings of the workshop on Memory Systems Performance and Correctness, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Penny Li, Jinuk Luke Shin, Georgios Konstadinidis, Francis Schumacher, Venkat Krishnaswamy, Hoyeol Cho, Sudesna Dash, Robert Masleid, Chaoyang Zheng, Yuanjung David Lin, et al. A 20nm 32-Core 64MB L3 cache SPARC M7 processor. In Solid-State Circuits Conference (ISSCC), 2015.Google ScholarGoogle Scholar
  31. Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt. Understanding and designing new server architectures for emerging warehouse-computing environments. In Computer Architecture (ISCA), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, et al. Scale-out processors. In Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Krishna T Malladi, Benjamin C Lee, Frank A Nothaft, Christos Kozyrakis, Karthika Periyathambi, and Mark Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Computer Architecture (ISCA), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Microarchitecture (MICRO), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. David Meisner, Christopher M Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F Wenisch. Power management of online data-intensive services. In Computer Architecture (ISCA), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. Very Large Data Bases (VLDB), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Dmitry Namiot and Manfred Sneps-Sneppe. On micro-services architecture. Open Information Technologies, 2014.Google ScholarGoogle Scholar
  39. Jian Ouyang, Hong Luo, Zilong Wang, Jiazi Tian, Chenghui Liu, and Kehua Sheng. FPGA implementation of GZIP compression and decompression for IDC services. In Field-Programmable Technology (FPT), 2010.Google ScholarGoogle ScholarCross RefCross Ref
  40. Mike P Papazoglou and Willem-Jan Van Den Heuvel. Service oriented architectures: approaches, technologies and research issues. The VLDB journal, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David A Patterson. The data center is the computer. Communications of the ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Microarchitecture (MICRO), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Microarchitecture (MICRO), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jan Van Lunteren, Ton Engbersen, Joe Bostian, Bill Carey, and Chris Larsson. XML accelerator engine. In Workshop on High Performance XML Processing, 2004.Google ScholarGoogle Scholar
  47. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In European Conference on Computer Systems (EuroSys), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Ahmad Yasin. A Top-Down method for performance analysis and counters architecture. Performance Analysis of Systems and Software (ISPASS), 2014.Google ScholarGoogle ScholarCross RefCross Ref
  49. Ahmad Yasin, Yosi Ben-Asher, and Avi Mendelson. Deep-dive Analysis of the Data Analytics Workload in CloudSuite. In Workload characterization (IIWSC), 2014.Google ScholarGoogle ScholarCross RefCross Ref
  50. Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. CPI2: CPU performance isolation for shared compute clusters. In European Conference on Computer Systems (EuroSys), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Profiling a warehouse-scale computer

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              • Published in

                cover image ACM SIGARCH Computer Architecture News
                ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
                ISCA'15
                June 2015
                745 pages
                ISSN:0163-5964
                DOI:10.1145/2872887
                Issue’s Table of Contents
                • cover image ACM Conferences
                  ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
                  June 2015
                  768 pages
                  ISBN:9781450334020
                  DOI:10.1145/2749469

                Copyright © 2015 Owner/Author

                Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 13 June 2015

                Check for updates

                Qualifiers

                • research-article

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader