Abstract
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.
We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
- David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In Operating systems principles (SOSP), 2009. Google ScholarDigital Library
- Jennifer Anderson, Lance Berc, George Chrysos, Jeffrey Dean, Sanjay Ghemawat, Jamey Hicks, Shun-Tak Leung, Mitch Lichtenberg, Mark Vandevoorde, Carl A Waldspurger, et al. Transparent, low-overhead profiling on modern processors. In Workshop on Profile and Feedback-Directed Compilation, 1998.Google Scholar
- Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. Call graph prefetching for database applications. Transactions of Computer Systems, 2003. Google ScholarDigital Library
- Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 2013. Google ScholarDigital Library
- Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 2003. Google ScholarDigital Library
- Paolo Calafiura, Stephane Eranian, David Levinthal, Sami Kama, and Roberto Agostino Vitillo. GOoDA: The generic optimization data analyzer. In Journal of Physics: Conference Series, 2012.Google ScholarCross Ref
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation (OSDI), 2006. Google ScholarDigital Library
- Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for FDO compilation. In Code generation and optimization (CGO), 2010. Google ScholarDigital Library
- Zefu Dai, Nick Ni, and Jianwen Zhu. A 1 cycle-per-byte XML parsing accelerator. In Field Programmable Gate Arrays, 2010. Google ScholarDigital Library
- Arnaldo Carvalho de Melo. The new linux 'perf' tools. In Slides from Linux Kongress, 2010.Google Scholar
- Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 2013. Google ScholarDigital Library
- Filipa Duarte and Stephan Wong. Cache-based memory copy hardware accelerator for multicore systems. IEEE Transactions on Computers, 2010. Google ScholarDigital Library
- Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E Smith. A top-down approach to architecting cpi component performance counters. IEEE Micro, 2007. Google ScholarDigital Library
- Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, and Anastasia Ailamaki. Clearing the clouds. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.Google ScholarDigital Library
- T. B. Ferreira, R. Matias, A. Macedo, and L. B. Araujo. An experimental study on memory allocators in multicore and multithreaded applications. In Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011. Google ScholarDigital Library
- Google. Bazel. http://bazel.io/.Google Scholar
- Google. gRPC. http://grpc.io/.Google Scholar
- Google. Protocol buffers. https://developers.google.com/protocol-buffers/.Google Scholar
- John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. 2012. Google ScholarDigital Library
- Aamer Jaleel. Memory characterization of workloads using instrumentation-driven simulation--a Pin-based memory characterization of the SPEC CPU2000 and SPEC CPU2006 benchmark suites. Intel Corporation, VSSAD, 2007.Google Scholar
- Aamer Jaleel, Joseph Nuzman, Adrian Moga, Simon C Steely Jr, and Joel Emer. High Performing Cache Hierarchies for Server Workloads. In High-Performance Computer Architecture (HPCA), 2015.Google Scholar
- Vijay Janapa Reddi, Benjamin C Lee, Trishul Chilimbi, and Kushagra Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. Computer Architecture (ISCA), 2010. Google ScholarDigital Library
- Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, and Chunjie Luo. Characterizing data analysis workloads in data centers. In Workload characterization (IIWSC), 2013.Google ScholarCross Ref
- Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A Kim. Measuring interference between live datacenter applications. In High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- Svilen Kanev, Kim Hazelwood, Gu-Yeon Wei, and David Brooks. Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications. In Workload Characterization (IISWC), 2014.Google ScholarCross Ref
- Aasheesh Kolli, Ali Saidi, and Thomas F. Wenisch. RDIP: Return-address-stack Directed Instruction Prefetching. In Microarchitecture (MICRO), 2013. Google ScholarDigital Library
- Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. Server engineering insights for large-scale online services. IEEE Micro, 2010. Google ScholarDigital Library
- Snehasish Kumar, Arrvindh Shriraman, Viji Srinivasan, Dan Lin, and Jordan Phillips. SQRL: Hardware Accelerator for Collecting Software Data Structures. In Parallel architectures and compilation (PACT), 2014. Google ScholarDigital Library
- Sangho Lee, Teresa Johnson, and Easwaran Raman. Feedback directed optimization of tcmalloc. In Proceedings of the workshop on Memory Systems Performance and Correctness, 2014. Google ScholarDigital Library
- Penny Li, Jinuk Luke Shin, Georgios Konstadinidis, Francis Schumacher, Venkat Krishnaswamy, Hoyeol Cho, Sudesna Dash, Robert Masleid, Chaoyang Zheng, Yuanjung David Lin, et al. A 20nm 32-Core 64MB L3 cache SPARC M7 processor. In Solid-State Circuits Conference (ISSCC), 2015.Google Scholar
- Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt. Understanding and designing new server architectures for emerging warehouse-computing environments. In Computer Architecture (ISCA), 2008. Google ScholarDigital Library
- Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, et al. Scale-out processors. In Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- Krishna T Malladi, Benjamin C Lee, Frank A Nothaft, Christos Kozyrakis, Karthika Periyathambi, and Mark Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Computer Architecture (ISCA), 2013. Google ScholarDigital Library
- Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Microarchitecture (MICRO), 2011. Google ScholarDigital Library
- David Meisner, Christopher M Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F Wenisch. Power management of online data-intensive services. In Computer Architecture (ISCA), 2011. Google ScholarDigital Library
- Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. Very Large Data Bases (VLDB), 2010. Google ScholarDigital Library
- Dmitry Namiot and Manfred Sneps-Sneppe. On micro-services architecture. Open Information Technologies, 2014.Google Scholar
- Jian Ouyang, Hong Luo, Zilong Wang, Jiazi Tian, Chenghui Liu, and Kehua Sheng. FPGA implementation of GZIP compression and decompression for IDC services. In Field-Programmable Technology (FPT), 2010.Google ScholarCross Ref
- Mike P Papazoglou and Willem-Jan Van Den Heuvel. Service oriented architectures: approaches, technologies and research issues. The VLDB journal, 2007. Google ScholarDigital Library
- David A Patterson. The data center is the computer. Communications of the ACM, 2008. Google ScholarDigital Library
- Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Microarchitecture (MICRO), 2006. Google ScholarDigital Library
- Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro, 2010. Google ScholarDigital Library
- Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Microarchitecture (MICRO), 2013. Google ScholarDigital Library
- Jan Van Lunteren, Ton Engbersen, Joe Bostian, Bill Carey, and Chris Larsson. XML accelerator engine. In Workshop on High Performance XML Processing, 2004.Google Scholar
- Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In European Conference on Computer Systems (EuroSys), 2015. Google ScholarDigital Library
- Ahmad Yasin. A Top-Down method for performance analysis and counters architecture. Performance Analysis of Systems and Software (ISPASS), 2014.Google ScholarCross Ref
- Ahmad Yasin, Yosi Ben-Asher, and Avi Mendelson. Deep-dive Analysis of the Data Analytics Workload in CloudSuite. In Workload characterization (IIWSC), 2014.Google ScholarCross Ref
- Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. CPI2: CPU performance isolation for shared compute clusters. In European Conference on Computer Systems (EuroSys), 2013. Google ScholarDigital Library
Index Terms
- Profiling a warehouse-scale computer
Recommendations
Profiling a warehouse-scale computer
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureWith the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server ...
Increasing Utilization in Modern Warehouse-Scale Computers Using Bubble-Up
Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise ...
Autonomous warehouse-scale computers
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation ConferenceModern Warehouse-Scale Computers (WSCs), composed of many generations of servers and a myriad of domain specific accelerators, are becoming increasingly heterogeneous. Meanwhile, WSC workloads are also becoming incredibly diverse with different ...
Comments