skip to main content
research-article

Safe and effective fine-grained TCP retransmissions for datacenter communication

Published:16 August 2009Publication History
Skip Abstract Section

Abstract

This paper presents a practical solution to a problem facing high-fan-in, high-bandwidth synchronized TCP workloads in datacenter Ethernets---the TCP incast problem. In these networks, receivers can experience a drastic reduction in application throughput when simultaneously requesting data from many servers using TCP. Inbound data overfills small switch buffers, leading to TCP timeouts lasting hundreds of milliseconds. For many datacenter workloads that have a barrier synchronization requirement (e.g., filesystem reads and parallel data-intensive queries), throughput is reduced by up to 90%. For latency-sensitive applications, TCP timeouts in the datacenter impose delays of hundreds of milliseconds in networks with round-trip-times in microseconds.

Our practical solution uses high-resolution timers to enable microsecond-granularity TCP timeouts. We demonstrate that this technique is effective in avoiding TCP incast collapse in simulation and in real-world experiments. We show that eliminating the minimum retransmission timeout bound is safe for all environments, including the wide-area.

References

  1. M. Allman, H. Balakrishnan, and S. Floyd. Enhancing TCP's Loss Recovery Using Limited Transmit. Internet Engineering Task Force, Jan. 2001. RFC 3042. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Allman and V. Paxson. On estimating end-to-end network path properties. In Proc. ACM SIGCOMM, Cambridge, MA, Sept. 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Aron and P. Druschel. Soft timers: Efficient microsecond software timer support for network processing. ACM Transactions on Computer Systems, 18(3):197--228, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Balakrishnan, V. N. Padmanabhan, and R. Katz. The effects of asymmetry on TCP performance. In Proc. ACM MOBICOM, Budapest, Hungary, Sept. 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Balakrishnan, V. N. Padmanabhan, S. Seshan, and R. Katz. A comparison of mechanisms for improving TCP performance over wireless links. In Proc. ACM SIGCOMM, Stanford, CA, Aug. 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. J. Braam. File systems for clusters from a protocol perspective. http://www.lustre.org.Google ScholarGoogle Scholar
  7. R. T. Braden. Requirements for Internet Hosts--Communication Layers. Internet Engineering Task Force, Oct. 1989. RFC 1122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. S. Brakmo, S. W. O'Malley, and L. L. Peterson. TCP vegas: New techniques for congestion detection and avoidance. In Proc. ACM SIGCOMM, London, England, Aug. 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Chen, R. Griffith, J. Liu, A. D. Joseph, and R. H. Katz. Understanding TCP incast throughput collapse in datacenter networks. In Proc. Workshop: Research on Enterprise Networking, Barcelona, Spain, Aug. 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. k. claffy, G. Polyzos, and H.-W. Braun. Measurement considerations for assessing unidirectional latencies. Internetworking: Research and Experience, 3(4):121--132, Sept. 1993.Google ScholarGoogle Scholar
  11. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Scaling memcached at Facebook. http://www.facebook.com/note.php?note_id=39391378919.Google ScholarGoogle Scholar
  13. S. Floyd and V. Jacobson. Random early detection gateways for congestion avoidance. IEEE/ACM Transactions on Networking, 1(4), Aug. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Ford. Structured streams: A new transport abstraction. In Proc. ACM SIGCOMM, Kyoto, Japan, Aug. 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ghemawat, H. Gobioff, and S.-T. Leung. The Google file system. In Proc. 19th ACM Symposium on Operating Systems Principles (SOSP), Lake George, NY, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. High-resolution timer subsystem. http://www.tglx.de/hrtimers.html.Google ScholarGoogle Scholar
  17. V. Jacobson. Congestion avoidance and control. In Proc. ACM SIGCOMM, pages 314--329, Vancouver, British Columbia, Canada, Sept. 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. V. Jacobson, R. Braden, and D. Borman. TCP Extensions for High Performance. Internet Engineering Task Force, May 1992. RFC 1323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Jin, D. X. Wei, and S. H. Low. FAST TCP: motivation, architecture, algorithms, performance.Google ScholarGoogle Scholar
  20. E. Kohler, M. Handley, and S. Floyd. Designing DCCP: Congestion control without reliability. In Proc. ACM SIGCOMM, Pisa, Italy, Aug. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Ludwig and M. Meyer. The Eifel Detection Algorithm for TCP. Internet Engineering Task Force, Apr. 2003. RFC 3522. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Mathis, J. Mahdavi, S. Floyd, and A. Romanow. TCP Selective Acknowledgment Options. Internet Engineering Task Force, 1996. RFC 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A distributed memory object caching system. http://www.danga.com/memcached/.Google ScholarGoogle Scholar
  24. A. Mukherjee. On the dynamics and significance of low frequency components of Internet load. Internetworking: Research and Experience, 5:163--205, Dec. 1994.Google ScholarGoogle Scholar
  25. D. Nagle, D. Serenyi, and A. Matthews. The Panasas ActiveScale Storage Cluster: Delivering scalable high bandwidth storage. In SC '04: Proceedings of the 2004 ACM/IEEE Conference on Supercomputing, Washington, DC, USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. ns-2 Network Simulator. http://www.isi.edu/nsnam/ns/, 2000.Google ScholarGoogle Scholar
  27. C. Partridge. Gigabit Networking. Addison-Wesley, Reading, MA, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Phanishayee, E. Krevat, V. Vasudevan, D. G. Andersen, G. R. Ganger, G. A. Gibson, and S. Seshan. Measurement and analysis of TCP throughput collapse in cluster-based storage systems. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Psaras and V. Tsaoussidis. The TCP minimum RTO revisited. In IFIP Networking, May 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. K. Ramakrishnan and S. Floyd. A Proposal to Add Explicit Congestion Notification (ECN) to IP. Internet Engineering Task Force, Jan. 1999. RFC 2481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Raman, H. Balakrishnan, and M. Srinivasan. An image transport protocol for the Internet. In Proc. International Conference on Network Protocols, Osaka, Japan, Nov. 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. P. Sarolahti and M. Kojo. Forward RTO-Recovery (F-RTO): An Algorithm for Detecting Spurious Retransmission Timeouts with TCP and the Stream Control Transmission Protocol (SCTP). Internet Engineering Task Force, Aug. 2005. RFC 4138.Google ScholarGoogle ScholarCross RefCross Ref
  33. S. Shepler, M. Eisler, and D. Noveck. NFSv4 Minor Version 1 -- Draft Standard. http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-minorversion1-29.t%xt.Google ScholarGoogle Scholar
  34. B. Welch, M. Unangst, Z. Abbasi, G. Gibson, B. Mueller, J. Zelenka, and B. Zhou. Scalable performance of the Panasas parallel file system. In Proc. USENIX Conference on File and Storage Technologies, San Jose, CA, Feb. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Y. Zhang, N. Duffield, V. Paxson, and S. Shenker. On the constancy of Internet path properties. In Proc. ACM SIGCOMM Internet Measurement Workshop, San Fransisco, CA, Nov. 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Safe and effective fine-grained TCP retransmissions for datacenter communication

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGCOMM Computer Communication Review
          ACM SIGCOMM Computer Communication Review  Volume 39, Issue 4
          SIGCOMM '09
          October 2009
          325 pages
          ISSN:0146-4833
          DOI:10.1145/1594977
          Issue’s Table of Contents
          • cover image ACM Conferences
            SIGCOMM '09: Proceedings of the ACM SIGCOMM 2009 conference on Data communication
            August 2009
            340 pages
            ISBN:9781605585949
            DOI:10.1145/1592568

          Copyright © 2009 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 16 August 2009

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader