skip to main content
research-article

Understanding network failures in data centers: measurement, analysis, and implications

Published: 15 August 2011 Publication History

Abstract

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic and how effective is network redundancy? We answer these questions using multiple data sources commonly collected by network operators. The key findings of our study are that (1) data center networks show high reliability, (2) commodity switches such as ToRs and AggS are highly reliable, (3) load balancers dominate in terms of failure occurrences with many short-lived software related faults,(4) failures have potential to cause loss of many small packets such as keep alive messages and ACKs, and (5) network redundancy is only 40% effective in reducing the median impact of failure.

Supplementary Material

JPG File (sigcomm_11_1.jpg)
MP4 File (sigcomm_11_1.mp4)

References

[1]
Cisco: Data center: Load balancing data center services, 2004. www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns668/net_implementation_white_paper0900aecd8053495a.html.
[2]
H. Abu-Libdeh, P. Costa, A. I. T. Rowstron, G. O'Shea, and A. Donnelly. Symbiotic routing in future data centers. In SIGCOMM, 2010.
[3]
M. Al-Fares, A. Loukissas, and A. Vahdat. A scalable, commodity data center network architecture. In SIGCOMM, 2008.
[4]
M. Alizadeh, A. Greenberg, D. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, and M. Sridharan. Data Center TCP (DCTCP). In SIGCOMM, 2010.
[5]
T. Benson, A. Akella, and D. Maltz. Network traffic characteristics of data centers in the wild. In IMC, 2010.
[6]
T. Benson, S. Sahu, A. Akella, and A. Shaikh. A first look at problems in the cloud. In HotCloud, 2010.
[7]
J. Brodkin. Amazon EC2 outage calls "availability zones" into question, 2011. http://www.networkworld.com/news/2011/042111-amazon-ec2-zones.html.
[8]
X. Chen, Y. Mao, Z. M. Mao, and K. van de Merwe. Declarative configuration management for complex and dynamic networks. In CoNEXT, 2010.
[9]
Cisco. UniDirectional Link Detection (UDLD). http://www.cisco.com/en/US/tech/tk866/tsd_technology_support_sub-protocol_home.html.
[10]
Cisco. Spanning tree protocol root guard enhancement, 2011. http://www.cisco.com/en/US/tech/tk389/tk621/technologies_tech_note09186a00800ae96b.shtml.
[11]
D. Ford, F. Labelle, F. Popovici, M. Stokely, V.-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in globally distributed storage systems. In OSDI, 2010.
[12]
A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A scalable and flexible data center network. In SIGCOMM, 2009.
[13]
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. DCell: A scalable and fault-tolerant network structure for data centers. In SIGCOMM, 2008.
[14]
C. Guo, H. Wu, K. Tan, L. Shiy, Y. Zhang, and S. Lu. BCube: A high performance, server-centric network architecture for modular data centers. In SIGCOMM, 2009.
[15]
D. Joseph, A. Tavakoli, and I. Stoica. A policy-aware switching layer for data centers. In SIGCOMM, 2008.
[16]
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM, 2010.
[17]
C. Kim, M. Caesar, and J. Rexford. Floodless in SEATTLE: a scalable ethernet architecture for large enterprises. In SIGCOMM, 2008.
[18]
C. Labovitz and A. Ahuja. Experimental study of internet stability and wide-area backbone failures. In The Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing, 1999.
[19]
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, Y. Ganjali, and C. Diot. Characterization of failures in an operational IP backbone network. IEEE/ACM Transactions on Networking, 2008.
[20]
N. Mckeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner. Openflow: enabling innovation in campus networks. In SIGCOMM CCR, 2008.
[21]
R. N. Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A scalable fault-tolerant layer 2 data center network fabric. In SIGCOMM, 2009.
[22]
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A study of end-to-end web access failures. In CoNEXT, 2006.
[23]
B. Schroeder and G. Gibson. Disk failures in the real world: What does an MTTF of 1,000,000 hours mean too you? In FAST, 2007.
[24]
B. Schroeder, E. Pinheiro, and W.-D. Weber. DRAM errors in the wild: A large-scale field study. In SIGMETRICS, 2009.
[25]
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A case study of OSPF behavior in a large enterprise network. In ACM IMW, 2002.
[26]
D. Turner, K. Levchenko, A. C. Snoeren, and S. Savage. California fault lines: Understanding the causes and impact of network failures. In SIGCOMM, 2010.
[27]
K. V. Vishwanath and N. Nagappan. Characterizing cloud computing hardware reliability. In Symposium on Cloud Computing (SOCC), 2010.
[28]
D. Watson, F. Jahanian, and C. Labovitz. Experiences with monitoring OSPF on a regional service provider network. In ICDCS, 2003.

Cited By

View all
  • (2025)GraphCC: A practical graph learning-based approach to Congestion Control in datacentersComputer Networks10.1016/j.comnet.2024.110981257(110981)Online publication date: Feb-2025
  • (2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
  • (2024)LubeRDMA: A Fail-safe Mechanism of RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663411(16-22)Online publication date: 3-Aug-2024
  • Show More Cited By

Index Terms

  1. Understanding network failures in data centers: measurement, analysis, and implications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM SIGCOMM Computer Communication Review
    ACM SIGCOMM Computer Communication Review  Volume 41, Issue 4
    SIGCOMM '11
    August 2011
    480 pages
    ISSN:0146-4833
    DOI:10.1145/2043164
    Issue’s Table of Contents
    • cover image ACM Conferences
      SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference
      August 2011
      502 pages
      ISBN:9781450307970
      DOI:10.1145/2018436
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 August 2011
    Published in SIGCOMM-CCR Volume 41, Issue 4

    Check for updates

    Author Tags

    1. data centers
    2. network reliability

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)836
    • Downloads (Last 6 weeks)65
    Reflects downloads up to 15 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)GraphCC: A practical graph learning-based approach to Congestion Control in datacentersComputer Networks10.1016/j.comnet.2024.110981257(110981)Online publication date: Feb-2025
    • (2024)An Improved Fault Diagnosis Algorithm for Highly Scalable Data Center NetworksMathematics10.3390/math1204059712:4(597)Online publication date: 17-Feb-2024
    • (2024)LubeRDMA: A Fail-safe Mechanism of RDMAProceedings of the 8th Asia-Pacific Workshop on Networking10.1145/3663408.3663411(16-22)Online publication date: 3-Aug-2024
    • (2024)Robust Routing Made Easy: Reinforcing Networks Against Non-Benign FaultsIEEE/ACM Transactions on Networking10.1109/TNET.2023.328318432:1(283-297)Online publication date: Feb-2024
    • (2024)Cyclic Matrix Coding to Mitigate ACK Blocking of MPTCP in Data Center NetworksIEEE Transactions on Cloud Computing10.1109/TCC.2024.336653412:2(419-430)Online publication date: Apr-2024
    • (2024)Achieving Low Latency for Multipath Transmission in RDMA Based Data Center NetworkIEEE Transactions on Cloud Computing10.1109/TCC.2024.336507512:1(337-346)Online publication date: Jan-2024
    • (2024)Local Fast Failover Routing on Directed Networks2024 14th International Workshop on Resilient Networks Design and Modeling (RNDM)10.1109/RNDM64105.2024.10820439(1-8)Online publication date: 25-Nov-2024
    • (2024)Slicify: Fault Injection Testing for Network Partitions2024 32nd International Conference on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS64422.2024.10786337(1-8)Online publication date: 21-Oct-2024
    • (2024)SyPer: Synthesis of Perfectly Resilient Local Fast Re-Routing Rules for Highly Dependable NetworksIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621323(2398-2407)Online publication date: 20-May-2024
    • (2024)Balancing Sdn Control Plane Availability and Traffic Engineering Efficiency in Data Centers2024 IEEE 32nd International Conference on Network Protocols (ICNP)10.1109/ICNP61940.2024.10858573(1-12)Online publication date: 28-Oct-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media