Profiling a warehouse-scale computer

Authors:
Svilen Kanev

Harvard University

Harvard University
View Profile

,
Juan Pablo Darago

Universidad de Buenos Aires

Universidad de Buenos Aires
View Profile

,
Kim Hazelwood

Yahoo Labs

Yahoo Labs
View Profile

,
Parthasarathy Ranganathan

Google

Google
View Profile

,
Tipp Moseley

Google

Google
View Profile

,
Gu-Yeon Wei

Harvard University

Harvard University
View Profile

,
David Brooks

Harvard University

Harvard University
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 43 Issue 3SJune 2015pp 158–169https://doi.org/10.1145/2872887.2750392

Published:13 June 2015Publication History

ACM SIGARCH Computer Architecture News

Abstract

With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications.

We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This "datacenter tax" can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.

References

David G Andersen, Jason Franklin, Michael Kaminsky, Amar Phanishayee, Lawrence Tan, and Vijay Vasudevan. FAWN: A fast array of wimpy nodes. In Operating systems principles (SOSP), 2009. Google ScholarDigital Library
Jennifer Anderson, Lance Berc, George Chrysos, Jeffrey Dean, Sanjay Ghemawat, Jamey Hicks, Shun-Tak Leung, Mitch Lichtenberg, Mark Vandevoorde, Carl A Waldspurger, et al. Transparent, low-overhead profiling on modern processors. In Workshop on Profile and Feedback-Directed Compilation, 1998.Google Scholar
Murali Annavaram, Jignesh M. Patel, and Edward S. Davidson. Call graph prefetching for database applications. Transactions of Computer Systems, 2003. Google ScholarDigital Library
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: an introduction to the design of warehouse-scale machines. Synthesis Lectures on Computer Architecture, 2013. Google ScholarDigital Library
Luiz André Barroso, Jeffrey Dean, and Urs Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 2003. Google ScholarDigital Library
Paolo Calafiura, Stephane Eranian, David Levinthal, Sami Kama, and Roberto Agostino Vitillo. GOoDA: The generic optimization data analyzer. In Journal of Physics: Conference Series, 2012.Google ScholarCross Ref
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A distributed storage system for structured data. In Operating Systems Design and Implementation (OSDI), 2006. Google ScholarDigital Library
Dehao Chen, Neil Vachharajani, Robert Hundt, Shih-wei Liao, Vinodha Ramasamy, Paul Yuan, Wenguang Chen, and Weimin Zheng. Taming hardware event samples for FDO compilation. In Code generation and optimization (CGO), 2010. Google ScholarDigital Library
Zefu Dai, Nick Ni, and Jianwen Zhu. A 1 cycle-per-byte XML parsing accelerator. In Field Programmable Gate Arrays, 2010. Google ScholarDigital Library
Arnaldo Carvalho de Melo. The new linux 'perf' tools. In Slides from Linux Kongress, 2010.Google Scholar
Jeffrey Dean and Luiz André Barroso. The tail at scale. Communications of the ACM, 2013. Google ScholarDigital Library
Filipa Duarte and Stephan Wong. Cache-based memory copy hardware accelerator for multicore systems. IEEE Transactions on Computers, 2010. Google ScholarDigital Library
Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E Smith. A top-down approach to architecting cpi component performance counters. IEEE Micro, 2007. Google ScholarDigital Library
Michael Ferdman, Babak Falsafi, Almutaz Adileh, Onur Kocberber, Stavros Volos, Mohammad Alisafaee, Djordje Jevdjic, Cansu Kaynak, Adrian Daniel Popescu, and Anastasia Ailamaki. Clearing the clouds. In Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2012.Google ScholarDigital Library
T. B. Ferreira, R. Matias, A. Macedo, and L. B. Araujo. An experimental study on memory allocators in multicore and multithreaded applications. In Parallel and Distributed Computing, Applications and Technologies (PDCAT), 2011. Google ScholarDigital Library
Google. Bazel. http://bazel.io/.Google Scholar
Google. gRPC. http://grpc.io/.Google Scholar
Google. Protocol buffers. https://developers.google.com/protocol-buffers/.Google Scholar
John L Hennessy and David A Patterson. Computer architecture: a quantitative approach. 2012. Google ScholarDigital Library
Aamer Jaleel. Memory characterization of workloads using instrumentation-driven simulation--a Pin-based memory characterization of the SPEC CPU2000 and SPEC CPU2006 benchmark suites. Intel Corporation, VSSAD, 2007.Google Scholar
Aamer Jaleel, Joseph Nuzman, Adrian Moga, Simon C Steely Jr, and Joel Emer. High Performing Cache Hierarchies for Server Workloads. In High-Performance Computer Architecture (HPCA), 2015.Google Scholar
Vijay Janapa Reddi, Benjamin C Lee, Trishul Chilimbi, and Kushagra Vaid. Web search using mobile cores: quantifying and mitigating the price of efficiency. Computer Architecture (ISCA), 2010. Google ScholarDigital Library
Zhen Jia, Lei Wang, Jianfeng Zhan, Lixin Zhang, and Chunjie Luo. Characterizing data analysis workloads in data centers. In Workload characterization (IIWSC), 2013.Google ScholarCross Ref
Melanie Kambadur, Tipp Moseley, Rick Hank, and Martha A Kim. Measuring interference between live datacenter applications. In High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
Svilen Kanev, Kim Hazelwood, Gu-Yeon Wei, and David Brooks. Tradeoffs between Power Management and Tail Latency in Warehouse-Scale Applications. In Workload Characterization (IISWC), 2014.Google ScholarCross Ref
Aasheesh Kolli, Ali Saidi, and Thomas F. Wenisch. RDIP: Return-address-stack Directed Instruction Prefetching. In Microarchitecture (MICRO), 2013. Google ScholarDigital Library
Christos Kozyrakis, Aman Kansal, Sriram Sankar, and Kushagra Vaid. Server engineering insights for large-scale online services. IEEE Micro, 2010. Google ScholarDigital Library
Snehasish Kumar, Arrvindh Shriraman, Viji Srinivasan, Dan Lin, and Jordan Phillips. SQRL: Hardware Accelerator for Collecting Software Data Structures. In Parallel architectures and compilation (PACT), 2014. Google ScholarDigital Library
Sangho Lee, Teresa Johnson, and Easwaran Raman. Feedback directed optimization of tcmalloc. In Proceedings of the workshop on Memory Systems Performance and Correctness, 2014. Google ScholarDigital Library
Penny Li, Jinuk Luke Shin, Georgios Konstadinidis, Francis Schumacher, Venkat Krishnaswamy, Hoyeol Cho, Sudesna Dash, Robert Masleid, Chaoyang Zheng, Yuanjung David Lin, et al. A 20nm 32-Core 64MB L3 cache SPARC M7 processor. In Solid-State Circuits Conference (ISSCC), 2015.Google Scholar
Kevin Lim, Parthasarathy Ranganathan, Jichuan Chang, Chandrakant Patel, Trevor Mudge, and Steven Reinhardt. Understanding and designing new server architectures for emerging warehouse-computing environments. In Computer Architecture (ISCA), 2008. Google ScholarDigital Library
Pejman Lotfi-Kamran, Boris Grot, Michael Ferdman, Stavros Volos, Onur Kocberber, Javier Picorel, Almutaz Adileh, Djordje Jevdjic, Sachin Idgunji, Emre Ozer, et al. Scale-out processors. In Computer Architecture (ISCA), 2012. Google ScholarDigital Library
Krishna T Malladi, Benjamin C Lee, Frank A Nothaft, Christos Kozyrakis, Karthika Periyathambi, and Mark Horowitz. Towards energy-proportional datacenter memory with mobile DRAM. Computer Architecture (ISCA), 2012. Google ScholarDigital Library
Jason Mars and Lingjia Tang. Whare-map: Heterogeneity in "homogeneous" warehouse-scale computers. In Computer Architecture (ISCA), 2013. Google ScholarDigital Library
Jason Mars, Lingjia Tang, Robert Hundt, Kevin Skadron, and Mary Lou Soffa. Bubble-up: Increasing utilization in modern warehouse scale computers via sensible co-locations. In Microarchitecture (MICRO), 2011. Google ScholarDigital Library
David Meisner, Christopher M Sadler, Luiz André Barroso, Wolf-Dietrich Weber, and Thomas F Wenisch. Power management of online data-intensive services. In Computer Architecture (ISCA), 2011. Google ScholarDigital Library
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analysis of web-scale datasets. Very Large Data Bases (VLDB), 2010. Google ScholarDigital Library
Dmitry Namiot and Manfred Sneps-Sneppe. On micro-services architecture. Open Information Technologies, 2014.Google Scholar
Jian Ouyang, Hong Luo, Zilong Wang, Jiazi Tian, Chenghui Liu, and Kehua Sheng. FPGA implementation of GZIP compression and decompression for IDC services. In Field-Programmable Technology (FPT), 2010.Google ScholarCross Ref
Mike P Papazoglou and Willem-Jan Van Den Heuvel. Service oriented architectures: approaches, technologies and research issues. The VLDB journal, 2007. Google ScholarDigital Library
David A Patterson. The data center is the computer. Communications of the ACM, 2008. Google ScholarDigital Library
Andrew Putnam, Adrian M Caulfield, Eric S Chung, Derek Chiou, Kypros Constantinides, John Demme, Hadi Esmaeilzadeh, Jeremy Fowers, Gopi Prashanth Gopal, Jan Gray, et al. A reconfigurable fabric for accelerating large-scale datacenter services. In Computer Architecture (ISCA), 2014. Google ScholarDigital Library
Moinuddin K Qureshi and Yale N Patt. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Microarchitecture (MICRO), 2006. Google ScholarDigital Library
Gang Ren, Eric Tune, Tipp Moseley, Yixin Shi, Silvius Rus, and Robert Hundt. Google-Wide Profiling: A Continuous Profiling Infrastructure for Data Centers. IEEE Micro, 2010. Google ScholarDigital Library
Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, Rachata Ausavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. RowClone: Fast and Energy-efficient in-DRAM Bulk Data Copy and Initialization. In Microarchitecture (MICRO), 2013. Google ScholarDigital Library
Jan Van Lunteren, Ton Engbersen, Joe Bostian, Bill Carey, and Chris Larsson. XML accelerator engine. In Workshop on High Performance XML Processing, 2004.Google Scholar
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. In European Conference on Computer Systems (EuroSys), 2015. Google ScholarDigital Library
Ahmad Yasin. A Top-Down method for performance analysis and counters architecture. Performance Analysis of Systems and Software (ISPASS), 2014.Google ScholarCross Ref
Ahmad Yasin, Yosi Ben-Asher, and Avi Mendelson. Deep-dive Analysis of the Data Analytics Workload in CloudSuite. In Workload characterization (IIWSC), 2014.Google ScholarCross Ref
Xiao Zhang, Eric Tune, Robert Hagmann, Rohit Jnagal, Vrigo Gokhale, and John Wilkes. CPI²: CPU performance isolation for shared compute clusters. In European Conference on Computer Systems (EuroSys), 2013. Google ScholarDigital Library

Index Terms

Profiling a warehouse-scale computer

Recommendations

Profiling a warehouse-scale computer
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server ...
Read More
Increasing Utilization in Modern Warehouse-Scale Computers Using Bubble-Up

Precisely predicting performance degradation due to colocating multiple executing applications on a single machine is critical for improving utilization in modern warehouse-scale computers (WSCs). Bubble-Up is the first mechanism for such precise ...
Read More
Autonomous warehouse-scale computers
DAC '20: Proceedings of the 57th ACM/EDAC/IEEE Design Automation Conference

Modern Warehouse-Scale Computers (WSCs), composed of many generations of servers and a myriad of domain specific accelerators, are becoming increasingly heterogeneous. Meanwhile, WSC workloads are also becoming incredibly diverse with different ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell
Copyright © 2015 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2015
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 282
  Total Citations
  View Citations
- 6,211
  Total Downloads
- Downloads (Last 12 months)1,119
- Downloads (Last 6 weeks)147
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Profiling a warehouse-scale computer

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Profiling a warehouse-scale computer

Increasing Utilization in Modern Warehouse-Scale Computers Using Bubble-Up

Autonomous warehouse-scale computers