research-article

Understanding network failures in data centers: measurement, analysis, and implications

Authors:

Nachiappan NagappanAuthors Info & Claims

ACM SIGCOMM Computer Communication Review, Volume 41, Issue 4

Pages 350 - 361

https://doi.org/10.1145/2043164.2018477

Published: 15 August 2011 Publication History

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Supplementary Material

JPG File (sigcomm_11_1.jpg)

Download
14.50 KB

MP4 File (sigcomm_11_1.mp4)

Download
107.91 MB

References

[1]

Cisco: Data center: Load balancing data center services, 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.

[2]

H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, 2010.

Digital Library

[3]

M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008.

Digital Library

[4]

M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010.

Digital Library

[5]

T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010.

Digital Library

[6]

T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look at problems in the cloud. In HotCloud, 2010.

Digital Library

[7]

J. Brodkin. Amazon EC2 outage calls "availability zones" into question, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.

[8]

X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarative configuration management for complex and dynamic networks. In CoNEXT, 2010.

Digital Library

[9]

Cisco. UniDirectional Link Detection (UDLD). http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.

[10]

Cisco. Spanning tree protocol root guard enhancement, 2011. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.

[11]

D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In OSDI, 2010.

Digital Library

[12]

A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.

Digital Library

[13]

C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, 2008.

Digital Library

[14]

C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009.

Digital Library

[15]

D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008.

Digital Library

[16]

S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM, 2010.

Digital Library

[17]

C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008.

Digital Library

[18]

C. Labovitz and A. Ahuja. Experimental study of internet stability and wide-area backbone failures. In The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999.

Digital Library

[19]

A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. Characterization of failures in an operational IP backbone network. IEEE/ACM Transactions on Networking, 2008.

Digital Library

[20]

N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. In SIGCOMM CCR, 2008.

Digital Library

[21]

R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.

Digital Library

[22]

V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006.

Digital Library

[23]

B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In FAST, 2007.

Digital Library

[24]

B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009.

Digital Library

[25]

A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002.

Digital Library

[26]

D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010.

Digital Library

[27]

K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010.

Digital Library

[28]

D. Watson, F. Jahanian, and C. Labovitz. Experiences with monitoring OSPF on a regional service provider network. In ICDCS, 2003.

Digital Library

Cited By

Bernárdez GSuárez-Varela JShi XXiao SCheng XBarlet-Ros PCabellos-Aparicio A(2025)GraphCC: A practical graph learning-based approach to Congestion Control in datacentersComputer Networks10.1016/j.comnet.2024.110981257(110981)Online publication date: Feb-2025
https://doi.org/10.1016/j.comnet.2024.110981
Lin WLi XChang JWang X(2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
https://doi.org/10.3390/math12040597
Lin SYang QYang ZWang YZhao S(2024)LubeRDMA: A Fail-safe Mechanism of RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663411(16-22)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.1145/3663408.3663411
Show More Cited By

Index Terms

Understanding network failures in data centers: measurement, analysis, and implications
1. Networks
  1. Network services
    1. Network management

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
A Large Scale Study of Data Center Network Reliability
IMC '18: Proceedings of the Internet Measurement Conference 2018

The ability to tolerate, remediate, and recover from network incidents (caused by device failures and fiber cuts, for example) is critical for building and operating highly-available web services. Achieving fault tolerance and failure preparedness ...
Reliability in layered networks with random link failures

We consider network reliability in layered networks where the lower layer experiences random link failures. In layered networks, each failure at the lower layer may lead to multiple failures at the upper layer. We generalize the classical polynomial ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGCOMM Computer Communication Review

ACM SIGCOMM Computer Communication Review Volume 41, Issue 4

SIGCOMM '11

August 2011

480 pages

ISSN:0146-4833

DOI:10.1145/2043164

Issue’s Table of Contents

SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference
August 2011
502 pages
ISBN:9781450307970
DOI:10.1145/2018436
General Chairs:
Srinivasan Keshav
University of Waterloo, Canada
,
Jörg Liebeherr
University of Toronto, Canada
,
Program Chairs:
John Byers
Boston University, USA
,
Jeffrey Mogul
HP Labs, USA

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 August 2011

Published in SIGCOMM-CCR Volume 41, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

687
Total Citations
View Citations
5,508
Total Downloads

Downloads (Last 12 months)836
Downloads (Last 6 weeks)65

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bernárdez GSuárez-Varela JShi XXiao SCheng XBarlet-Ros PCabellos-Aparicio A(2025)GraphCC: A practical graph learning-based approach to Congestion Control in datacentersComputer Networks10.1016/j.comnet.2024.110981257(110981)Online publication date: Feb-2025
https://doi.org/10.1016/j.comnet.2024.110981
Lin WLi XChang JWang X(2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
https://doi.org/10.3390/math12040597
Lin SYang QYang ZWang YZhao S(2024)LubeRDMA: A Fail-safe Mechanism of RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663411(16-22)Online publication date: 3-Aug-2024
https://dl.acm.org/doi/10.1145/3663408.3663411
Lenzen CMedina MSaberi MSchmid S(2024)Robust Routing Made Easy: Reinforcing Networks Against Non-Benign FaultsIEEE/ACM Transactions on Networking10.1109/TNET.2023.328318432:1(283-297)Online publication date: Feb-2024
https://doi.org/10.1109/TNET.2023.3283184
Li ZHuang JWang SLyu WWang J(2024)Cyclic Matrix Coding to Mitigate ACK Blocking of MPTCP in Data Center NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.336653412:2(419-430)Online publication date: Apr-2024
https://doi.org/10.1109/TCC.2024.3366534
Li ZHuang JWang SWang J(2024)Achieving Low Latency for Multipath Transmission in RDMA Based Data Center NetworkIEEE Transactions on Cloud Computing10.1109/TCC.2024.336507512:1(337-346)Online publication date: Jan-2024
https://doi.org/10.1109/TCC.2024.3365075
Grobe JAlthoff SFoerster K(2024)Local Fast Failover Routing on Directed Networks2024 14th International Workshop on Resilient Networks Design and Modeling (RNDM)10.1109/RNDM64105.2024.10820439(1-8)Online publication date: 25-Nov-2024
https://doi.org/10.1109/RNDM64105.2024.10820439
Khaleel SUdayashankar SAl-Kiswany S(2024)Slicify: Fault Injection Testing for Network Partitions2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS64422.2024.10786337(1-8)Online publication date: 21-Oct-2024
https://doi.org/10.1109/MASCOTS64422.2024.10786337
Györgyi CLarsen KSchmid SSrba J(2024)SyPer: Synthesis of Perfectly Resilient Local Fast Re-Routing Rules for Highly Dependable NetworksIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621323(2398-2407)Online publication date: 20-May-2024
https://doi.org/10.1109/INFOCOM52122.2024.10621323
Chang BHe KChen SLin JZhang MWu WAkella A(2024)Balancing Sdn Control Plane Availability and Traffic Engineering Efficiency in Data Centers2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858573(1-12)Online publication date: 28-Oct-2024
https://doi.org/10.1109/ICNP61940.2024.10858573
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents