research-article

Safe and effective fine-grained TCP retransmissions for datacenter communication

Authors:
Vijay Vasudevan

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Amar Phanishayee

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Hiral Shah

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Elie Krevat

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
David G. Andersen

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Gregory R. Ganger

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Garth A. Gibson

Carnegie Mellon University and Panasas, Inc., Pittsburgh, PA, USA

Carnegie Mellon University and Panasas, Inc., Pittsburgh, PA, USA
View Profile

,
Brian Mueller

Panasas, Inc., Pittsburgh, PA, USA

Panasas, Inc., Pittsburgh, PA, USA
View Profile

ACM SIGCOMM Computer Communication Review Volume 39 Issue 4October 2009pp 303–314https://doi.org/10.1145/1594977.1592604

Published:16 August 2009Publication History

ACM SIGCOMM Computer Communication Review

Abstract

This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a barrier synchronization requirement (e.g., filesystem reads and parallel data-intensive queries), throughput is reduced by up to 90%. For latency-sensitive applications, TCP timeouts in the datacenter impose delays of hundreds of milliseconds in networks with round-trip-times in microseconds.

Our practical solution uses high-resolution timers to enable microsecond-granularity TCP timeouts. We demonstrate that this technique is effective in avoiding TCP incast collapse in simulation and in real-world experiments. We show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.

References

M. Allman, H. Balakrishnan, and S. Floyd. Enhancing TCP's Loss Recovery Using Limited Transmit. Internet Engineering Task Force, Jan. 2001. RFC 3042. Google ScholarDigital Library
M. Allman and V. Paxson. On estimating end-to-end network path properties. In Proc. ACM SIGCOMM, Cambridge, MA, Sept. 1999. Google ScholarDigital Library
M. Aron and P. Druschel. Soft timers: Efficient microsecond software timer support for network processing. ACM Transactions on Computer Systems, 18(3):197--228, 2000. Google ScholarDigital Library
H. Balakrishnan, V. N. Padmanabhan, and R. Katz. The effects of asymmetry on TCP performance. In Proc. ACM MOBICOM, Budapest, Hungary, Sept. 1997. Google ScholarDigital Library
H. Balakrishnan, V. N. Padmanabhan, S. Seshan, and R. Katz. A comparison of mechanisms for improving TCP performance over wireless links. In Proc. ACM SIGCOMM, Stanford, CA, Aug. 1996. Google ScholarDigital Library
P. J. Braam. File systems for clusters from a protocol perspective. http://www.lustre.org.Google Scholar
R. T. Braden. Requirements for Internet Hosts--Communication Layers. Internet Engineering Task Force, Oct. 1989. RFC 1122. Google ScholarDigital Library
L. S. Brakmo, S. W. O'Malley, and L. L. Peterson. TCP vegas: New techniques for congestion detection and avoidance. In Proc. ACM SIGCOMM, London, England, Aug. 1994. Google ScholarDigital Library
Y. Chen, R. Griffith, J. Liu, A. D. Joseph, and R. H. Katz. Understanding TCP incast throughput collapse in datacenter networks. In Proc. Workshop: Research on Enterprise Networking, Barcelona, Spain, Aug. 2009. Google ScholarDigital Library
k. claffy, G. Polyzos, and H.-W. Braun. Measurement considerations for assessing unidirectional latencies. Internetworking: Research and Experience, 3(4):121--132, Sept. 1993.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004. Google ScholarDigital Library
Scaling memcached at Facebook. http://www.facebook.com/note.php?note_id=39391378919.Google Scholar
S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4), Aug. 1993. Google ScholarDigital Library
B. Ford. Structured streams: A new transport abstraction. In Proc. ACM SIGCOMM, Kyoto, Japan, Aug. 2007. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. 19th ACM Symposium on Operating Systems Principles (SOSP), Lake George, NY, Oct. 2003. Google ScholarDigital Library
High-resolution timer subsystem. http://www.tglx.de/hrtimers.html.Google Scholar
V. Jacobson. Congestion avoidance and control. In Proc. ACM SIGCOMM, pages 314--329, Vancouver, British Columbia, Canada, Sept. 1998. Google ScholarDigital Library
V. Jacobson, R. Braden, and D. Borman. TCP Extensions for High Performance. Internet Engineering Task Force, May 1992. RFC 1323. Google ScholarDigital Library
C. Jin, D. X. Wei, and S. H. Low. FAST TCP: motivation, architecture, algorithms, performance.Google Scholar
E. Kohler, M. Handley, and S. Floyd. Designing DCCP: Congestion control without reliability. In Proc. ACM SIGCOMM, Pisa, Italy, Aug. 2006. Google ScholarDigital Library
R. Ludwig and M. Meyer. The Eifel Detection Algorithm for TCP. Internet Engineering Task Force, Apr. 2003. RFC 3522. Google ScholarDigital Library
M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP Selective Acknowledgment Options. Internet Engineering Task Force, 1996. RFC 2018. Google ScholarDigital Library
A distributed memory object caching system. http://www.danga.com/memcached/.Google Scholar
A. Mukherjee. On the dynamics and significance of low frequency components of Internet load. Internetworking: Research and Experience, 5:163--205, Dec. 1994.Google Scholar
D. Nagle, D. Serenyi, and A. Matthews. The Panasas ActiveScale Storage Cluster: Delivering scalable high bandwidth storage. In SC '04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2004. Google ScholarDigital Library
ns-2 Network Simulator. http://www.isi.edu/nsnam/ns/, 2000.Google Scholar
C. Partridge. Gigabit Networking. Addison-Wesley, Reading, MA, 1994. Google ScholarDigital Library
A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarDigital Library
I. Psaras and V. Tsaoussidis. The TCP minimum RTO revisited. In IFIP Networking, May 2007. Google ScholarDigital Library
K. Ramakrishnan and S. Floyd. A Proposal to Add Explicit Congestion Notification (ECN) to IP. Internet Engineering Task Force, Jan. 1999. RFC 2481. Google ScholarDigital Library
S. Raman, H. Balakrishnan, and M. Srinivasan. An image transport protocol for the Internet. In Proc. International Conference on Network Protocols, Osaka, Japan, Nov. 2000. Google ScholarDigital Library
P. Sarolahti and M. Kojo. Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP). Internet Engineering Task Force, Aug. 2005. RFC 4138.Google ScholarCross Ref
S. Shepler, M. Eisler, and D. Noveck. NFSv4 Minor Version 1 -- Draft Standard. http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-29.t%xt.Google Scholar
B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Zelenka, and B. Zhou. Scalable performance of the Panasas parallel file system. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarDigital Library
Y. Zhang, N. Duffield, V. Paxson, and S. Shenker. On the constancy of Internet path properties. In Proc. ACM SIGCOMM Internet Measurement Workshop, San Fransisco, CA, Nov. 2001. Google ScholarDigital Library

Index Terms

Safe and effective fine-grained TCP retransmissions for datacenter communication
1. Networks

Recommendations

Safe and effective fine-grained TCP retransmissions for datacenter communication
SIGCOMM '09: Proceedings of the ACM SIGCOMM 2009 conference on Data communication

This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application ...
Read More
Throughput optimization of TCP incast congestion control in large-scale datacenter networks

The many-to-one traffic pattern in datacenter networks leads to Transmission Control Protocol (TCP) incast congestion and puts unprecedented pressure to cloud service providers. The abnormal TCP behaviors in incast increase system response time and ...
Read More
Analysis for TCP in data center networks

The unfairness caused by bandwidth sharing via TCP in data center networks is called TCP Outcast problem. Some researchers show that the throughput of a flow with small Round Trip Time (RTT) is less than that with large RTT which is completely contrary ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGCOMM Computer Communication Review Volume 39, Issue 4
SIGCOMM '09
October 2009
325 pages
ISSN:0146-4833
DOI:10.1145/1594977
Issue’s Table of Contents
SIGCOMM '09: Proceedings of the ACM SIGCOMM 2009 conference on Data communication
August 2009
340 pages
ISBN:9781605585949
DOI:10.1145/1592568
General Chairs:
Pablo Rodriguez
Telefonica Research, Spain
,
Ernst Biersack
Eurecom, France
,
Program Chairs:
Konstantina Papagiannaki
Intel Labs Pittsburgh, USA
,
Luigi Rizzo
Università di Pisa, Italy
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 August 2009
Check for updates
Author Tags
datacenter networks
incast
performance
throughput
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 373
  Total Citations
  View Citations
- 2,140
  Total Downloads
- Downloads (Last 12 months)101
- Downloads (Last 6 weeks)13
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Safe and effective fine-grained TCP retransmissions for datacenter communication

ACM SIGCOMM Computer Communication Review

Abstract

References

Cited By

Index Terms

Recommendations

Safe and effective fine-grained TCP retransmissions for datacenter communication

Throughput optimization of TCP incast congestion control in large-scale datacenter networks

Analysis for TCP in data center networks