Abstract
This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a barrier synchronization requirement (e.g., filesystem reads and parallel data-intensive queries), throughput is reduced by up to 90%. For latency-sensitive applications, TCP timeouts in the datacenter impose delays of hundreds of milliseconds in networks with round-trip-times in microseconds.
Our practical solution uses high-resolution timers to enable microsecond-granularity TCP timeouts. We demonstrate that this technique is effective in avoiding TCP incast collapse in simulation and in real-world experiments. We show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.
- M. Allman, H. Balakrishnan, and S. Floyd. Enhancing TCP's Loss Recovery Using Limited Transmit. Internet Engineering Task Force, Jan. 2001. RFC 3042. Google ScholarDigital Library
- M. Allman and V. Paxson. On estimating end-to-end network path properties. In Proc. ACM SIGCOMM, Cambridge, MA, Sept. 1999. Google ScholarDigital Library
- M. Aron and P. Druschel. Soft timers: Efficient microsecond software timer support for network processing. ACM Transactions on Computer Systems, 18(3):197--228, 2000. Google ScholarDigital Library
- H. Balakrishnan, V. N. Padmanabhan, and R. Katz. The effects of asymmetry on TCP performance. In Proc. ACM MOBICOM, Budapest, Hungary, Sept. 1997. Google ScholarDigital Library
- H. Balakrishnan, V. N. Padmanabhan, S. Seshan, and R. Katz. A comparison of mechanisms for improving TCP performance over wireless links. In Proc. ACM SIGCOMM, Stanford, CA, Aug. 1996. Google ScholarDigital Library
- P. J. Braam. File systems for clusters from a protocol perspective. http://www.lustre.org.Google Scholar
- R. T. Braden. Requirements for Internet Hosts--Communication Layers. Internet Engineering Task Force, Oct. 1989. RFC 1122. Google ScholarDigital Library
- L. S. Brakmo, S. W. O'Malley, and L. L. Peterson. TCP vegas: New techniques for congestion detection and avoidance. In Proc. ACM SIGCOMM, London, England, Aug. 1994. Google ScholarDigital Library
- Y. Chen, R. Griffith, J. Liu, A. D. Joseph, and R. H. Katz. Understanding TCP incast throughput collapse in datacenter networks. In Proc. Workshop: Research on Enterprise Networking, Barcelona, Spain, Aug. 2009. Google ScholarDigital Library
- k. claffy, G. Polyzos, and H.-W. Braun. Measurement considerations for assessing unidirectional latencies. Internetworking: Research and Experience, 3(4):121--132, Sept. 1993.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004. Google ScholarDigital Library
- Scaling memcached at Facebook. http://www.facebook.com/note.php?note_id=39391378919.Google Scholar
- S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4), Aug. 1993. Google ScholarDigital Library
- B. Ford. Structured streams: A new transport abstraction. In Proc. ACM SIGCOMM, Kyoto, Japan, Aug. 2007. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. 19th ACM Symposium on Operating Systems Principles (SOSP), Lake George, NY, Oct. 2003. Google ScholarDigital Library
- High-resolution timer subsystem. http://www.tglx.de/hrtimers.html.Google Scholar
- V. Jacobson. Congestion avoidance and control. In Proc. ACM SIGCOMM, pages 314--329, Vancouver, British Columbia, Canada, Sept. 1998. Google ScholarDigital Library
- V. Jacobson, R. Braden, and D. Borman. TCP Extensions for High Performance. Internet Engineering Task Force, May 1992. RFC 1323. Google ScholarDigital Library
- C. Jin, D. X. Wei, and S. H. Low. FAST TCP: motivation, architecture, algorithms, performance.Google Scholar
- E. Kohler, M. Handley, and S. Floyd. Designing DCCP: Congestion control without reliability. In Proc. ACM SIGCOMM, Pisa, Italy, Aug. 2006. Google ScholarDigital Library
- R. Ludwig and M. Meyer. The Eifel Detection Algorithm for TCP. Internet Engineering Task Force, Apr. 2003. RFC 3522. Google ScholarDigital Library
- M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP Selective Acknowledgment Options. Internet Engineering Task Force, 1996. RFC 2018. Google ScholarDigital Library
- A distributed memory object caching system. http://www.danga.com/memcached/.Google Scholar
- A. Mukherjee. On the dynamics and significance of low frequency components of Internet load. Internetworking: Research and Experience, 5:163--205, Dec. 1994.Google Scholar
- D. Nagle, D. Serenyi, and A. Matthews. The Panasas ActiveScale Storage Cluster: Delivering scalable high bandwidth storage. In SC '04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2004. Google ScholarDigital Library
- ns-2 Network Simulator. http://www.isi.edu/nsnam/ns/, 2000.Google Scholar
- C. Partridge. Gigabit Networking. Addison-Wesley, Reading, MA, 1994. Google ScholarDigital Library
- A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarDigital Library
- I. Psaras and V. Tsaoussidis. The TCP minimum RTO revisited. In IFIP Networking, May 2007. Google ScholarDigital Library
- K. Ramakrishnan and S. Floyd. A Proposal to Add Explicit Congestion Notification (ECN) to IP. Internet Engineering Task Force, Jan. 1999. RFC 2481. Google ScholarDigital Library
- S. Raman, H. Balakrishnan, and M. Srinivasan. An image transport protocol for the Internet. In Proc. International Conference on Network Protocols, Osaka, Japan, Nov. 2000. Google ScholarDigital Library
- P. Sarolahti and M. Kojo. Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP). Internet Engineering Task Force, Aug. 2005. RFC 4138.Google ScholarCross Ref
- S. Shepler, M. Eisler, and D. Noveck. NFSv4 Minor Version 1 -- Draft Standard. http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-29.t%xt.Google Scholar
- B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Zelenka, and B. Zhou. Scalable performance of the Panasas parallel file system. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarDigital Library
- Y. Zhang, N. Duffield, V. Paxson, and S. Shenker. On the constancy of Internet path properties. In Proc. ACM SIGCOMM Internet Measurement Workshop, San Fransisco, CA, Nov. 2001. Google ScholarDigital Library
Index Terms
- Safe and effective fine-grained TCP retransmissions for datacenter communication
Recommendations
Safe and effective fine-grained TCP retransmissions for datacenter communication
SIGCOMM '09: Proceedings of the ACM SIGCOMM 2009 conference on Data communicationThis paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application ...
Throughput optimization of TCP incast congestion control in large-scale datacenter networks
The many-to-one traffic pattern in datacenter networks leads to Transmission Control Protocol (TCP) incast congestion and puts unprecedented pressure to cloud service providers. The abnormal TCP behaviors in incast increase system response time and ...
Analysis for TCP in data center networks
The unfairness caused by bandwidth sharing via TCP in data center networks is called TCP Outcast problem. Some researchers show that the throughput of a flow with small Round Trip Time (RTT) is less than that with large RTT which is completely contrary ...
Comments