skip to main content
10.1145/3341302.3342073acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article

Zooming in on wide-area latencies to a global cloud provider

Published:19 August 2019Publication History

ABSTRACT

The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. In particular, we focus on instances of latency degradation and design a tool, BlameIt, that enables cloud operators to localize the cause (i.e., faulty AS) of such degradation. BlameIt uses passive diagnosis, using measurements of existing connections between clients and the cloud locations, to localize the cause to one of cloud, middle, or client segments. Then it invokes selective active probing (within a probing budget) to localize the cause more precisely. We validate BlameIt by comparing its automatic fault localization results with that arrived at by network engineers manually, and observe that BlameIt correctly localized the problem in all the 88 incidents. Further, BlameIt issues 72X fewer active probes than a solution relying on active probing alone, and is deployed in production at Azure.

Skip Supplemental Material Section

Supplemental Material

p104-jin.mp4

mp4

937.5 MB

References

  1. Google Video Quality Report. https://support.google.com/youtube/answer/6013340?hl=en.Google ScholarGoogle Scholar
  2. B. Ager, N. Chatzis, A. Feldmann, N. Sarrar, S. Uhlig, and W. Willinger. Anatomy of a large european ixp. ACM SIGCOMM Computer Communication Review, 42(4):163--174, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. H. Liu, J. Padhye, B. T. Loo, and G. Outhred. 007: Democratically finding the cause of packet drops. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 419--435, Renton, WA, 2018. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Arzani, S. Ciraci, B. T. Loo, A. Schuster, and G. Outhred. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 440--453. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. B. Augustin, X. Cuvellier, B. Orgogozo, F. Viger, T. Friedman, M. Latapy, C. Magnien, and R. Teixeira. Avoiding traceroute anomalies with paris traceroute. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 153--158. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broido and k. claffy. Analysis of RouteViews BGP data: policy atoms. In Network Resource Data Management Workshop, Santa Barbara, CA, May 2001.Google ScholarGoogle Scholar
  7. M. Calder, X. Fan, Z. Hu, E. Katz-Bassett, J. Heidemann, and R. Govindan. Mapping the expansion of google's serving infrastructure. In Proceedings of the 2013 conference on Internet measurement conference, pages 313--326. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Calder, R. Gao, M. Schröder, R. Stewart, J. Padhye, R. Mahajan, G. Ananthanarayanan, and E. Katz-Bassett. Odin: Microsoft's scalable fault-tolerant CDN measurement system. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, 2018. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu. Network tomography: Recent developments. Statistical science, pages 499--517, 2004.Google ScholarGoogle Scholar
  10. F. Chen, R. K. Sitaraman, and M. Torres. End-user mapping: Next generation request routing for content delivery. In ACM SIGCOMM Computer Communication Review, volume 45, pages 167--181. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Í. Cunha, P. Marchetta, M. Calder, Y.-C. Chiu, B. Schlinker, B. V. Machado, A. Pescapè, V. Giotsas, H. V. Madhyastha, and E. Katz-Bassett. Sibyl: A practical internet route oracle. In NSDI, pages 325--344, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dhamdhere, D. D. Clark, A. Gamero-Garrido, M. Luckie, R. K. Mok, G. Akiwate, K. Gogia, V. Bajpai, A. C. Snoeren, and K. Claffy. Inferring persistent interdomain congestion. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 1--15. ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Duffield. Network tomography of binary network performance characteristics. IEEE Transactions on Information Theory, 52(12):5373--5388, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Flavel, P. Mani, D. A. Maltz, N. Holt, J. Liu, Y. Chen, and O. Surmachev. Fastroute: A scalable load-aware anycast routing architecture for modern cdns. connections, 27:19, 2015.Google ScholarGoogle Scholar
  15. D. Ghita, K. Argyraki, and P. Thiran. Network tomography on correlated links. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 225--238. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Ghita, C. Karakus, K. Argyraki, and P. Thiran. Shifting network tomography toward a practical goal. In Proceedings of the Seventh COnference on Emerging Networking EXperiments and Technologies, CoNEXT '11, pages 24:1--24:12, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Giotsas, C. Dietzel, G. Smaragdakis, A. Feldmann, A. Berger, and E. Aben. Detecting peering infrastructure outages in the wild. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 446--459. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Haq, M. Raja, and F. R. Dogar. Measuring and improving the reliability of wide-area cloud paths. In Proceedings of the 26th International Conference on World Wide Web, pages 253--262. International World Wide Web Conferences Steering Committee, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Y. He, M. Faloutsos, S. Krishnamurthy, and B. Huffaker. On routing asymmetry in the internet. In GLOBECOM'05. IEEE Global Telecommunications Conference, 2005., volume 2, pages 6--pp. IEEE, 2005.Google ScholarGoogle Scholar
  20. J. Jiang, R. Das, G. Ananthanarayanan, P. A. Chou, V. Padmanabhan, V. Sekar, E. Dominique, M. Goliszewski, D. Kukoleca, R. Vafin, et al. Via: Improving internet telephony call quality using predictive relay selection. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference, pages 286--299. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Kanuparthy and C. Dovrolis. Pythia: Diagnosing performance problems in wide area providers. In USENIX Annual Technical Conference, pages 371--382, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Krishnan, H. V. Madhyastha, S. Jain, S. Srinivasan, A. Krishnamurthy, T. Anderson, and J. Gao. Moving beyond end-to-end path information to optimize cdn performance. In Internet Measurement Conference (IMC), pages 190--201, Chicago, IL, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In ACM SIGCOMM Computer Communication Review, volume 34, pages 219--230. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Lau, S. H. Rubin, M. H. Smith, and L. Trajkovic. Distributed denial of service attacks. In Systems, Man, and Cybernetics, 2000 IEEE International Conference on, volume 3, pages 2275--2280. IEEE, 2000.Google ScholarGoogle Scholar
  25. Y. Lee and N. Spring. Identifying and aggregating homogeneous ipv4 /24 blocks with hobbit. In Internet Measurement Conference (IMC), Santa Monica, CA, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani. iplane: An information plane for distributed services. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 367--380. USENIX Association, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large iptv network. In ACM SIGCOMM Computer Communication Review, volume 39, pages 231--242. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Mao, J. Rexford, J. Wang, and R. Katz. Towards an accurate as-level traceroute tool. In ACM SIGCOMM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. V. N. Padmanabhan, S. Ramabhadran, and J. Padhye. Netprofiler: Profiling wide-area networks using peer cooperation. In International Workshop on Peer-to-Peer Systems, pages 80--92. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Quan, J. Heidemann, and Y. Pradkin. Trinocular: Understanding internet reliability through adaptive probing. In ACM SIGCOMM Computer Communication Review, volume 43, pages 255--266. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. A. Roy, H. Zeng, J. Bagga, and A. C. Snoeren. Passive realtime datacenter fault detection and localization. In NSDI, pages 595--612, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Schlinker, H. Kim, T. Cui, E. Katz-Bassett, H. V. Madhyastha, I. Cunha, J. Quinn, S. Hasan, P. Lapukhov, and H. Zeng. Engineering egress with edge fabric: Steering oceans of content to the world. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 418--431. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Singla, B. Chandrasekaran, P. Godfrey, and B. Maggs. The internet at the speed of light. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks, page 1. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Spring, R. Mahajan, and T. Anderson. The causes of path inflation. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pages 113--124. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Steenbergen. A practical guide to (correctly) a practical guide to (correctly) troubleshooting with traceroute. In NANOG, 2017.Google ScholarGoogle Scholar
  36. V. Valancius, B. Ravi, N. Feamster, and A. C. Snoeren. Quantifying the benefits of joint content and network routing. In ACM SIGMETRICS Performance Evaluation Review, volume 41, pages 243--254. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. K.-K. Yap, M. Motiwala, J. Rahe, S. Padgett, M. Holliman, G. Baldus, M. Hines, T. Kim, A. Narayanan, A. Jain, et al. Taking the edge off with espresso: Scale, reliability and programmability for global internet peering. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 432--445. ACM, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Zhang, C. Zhang, V. S. Pai, L. L. Peterson, and R. Y. Wang. Planetseer: Internet path failure monitoring and characterization in wide-area services. In OSDI, volume 4, pages 12--12, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Z. Zhang, M. Zhang, A. G. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian. Optimizing cost and performance in online service provider networks. In NSDI, pages 33--48, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Zooming in on wide-area latencies to a global cloud provider

            Recommendations

            Reviews

            Mariam Kiran

            The authors measure wide area network (WAN) latency from the viewpoint of a large cloud provider, Azure, by tracking the round-trip time (RTT) of transmission control protocol (TCP) connections. Presenting their tool BlameIt, the authors aim to find the faults and diagnose where the WAN is having issues. Tracking where the problem is happening in a large WAN is a pressing challenge in networks today. It is difficult to find where and why problems are occurring, such as data not reaching its destination or packets being lost along the way, as the networks grow and become more complex. This paper presents a passive measurement tool to help localize certain problems in a WAN. The paper first does a measurement analysis on various aspects of the Azure network. It describes the datasets collected and how they are able to deduce (1) the common countries in which bad RTT is recorded, (2) how long these bad connections last, and (3) how it affects their clients. It then goes on to present BlameIt. The tool is able to passively record various RTT-relevant data to understand where the problems are happening: client-side, middle, or end-side. A number of issues are recognized, for example, middle-segment problems dominate in India, China, and Brazil. The authors also found that the US has more directly related high RTTs than the rest of the world. By taking measurements on autonomous systems (AS) and the border gateway protocol (BGP), where there is a latency degradation between client and cloud locations, the tool uses a combination of passive measurements (TCP handshake RTTs) and selective active measurements (traceroutes) to localize issues. The paper is easy to read, and it's exciting to see how Azure measures and determines where bad performance is happening on its network. In other networks, tools such as perfSONAR and measuring loss are used, and it would be interesting to see how Google Cloud Platform (GCP) and Amazon Web Services (AWS) measure their network performance. This paper is a good read for those working to improve network performance using machine learning.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGCOMM '19: Proceedings of the ACM Special Interest Group on Data Communication
              August 2019
              526 pages
              ISBN:9781450359566
              DOI:10.1145/3341302

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 19 August 2019

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate554of3,547submissions,16%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader