ABSTRACT
The network communications between the cloud and the client have become the weak link for global cloud services that aim to provide low latency services to their clients. In this paper, we first characterize WAN latency from the viewpoint of a large cloud provider Azure, whose network edges serve hundreds of billions of TCP connections a day across hundreds of locations worldwide. In particular, we focus on instances of latency degradation and design a tool, BlameIt, that enables cloud operators to localize the cause (i.e., faulty AS) of such degradation. BlameIt uses passive diagnosis, using measurements of existing connections between clients and the cloud locations, to localize the cause to one of cloud, middle, or client segments. Then it invokes selective active probing (within a probing budget) to localize the cause more precisely. We validate BlameIt by comparing its automatic fault localization results with that arrived at by network engineers manually, and observe that BlameIt correctly localized the problem in all the 88 incidents. Further, BlameIt issues 72X fewer active probes than a solution relying on active probing alone, and is deployed in production at Azure.
Supplemental Material
- Google Video Quality Report. https://support.google.com/youtube/answer/6013340?hl=en.Google Scholar
- B. Ager, N. Chatzis, A. Feldmann, N. Sarrar, S. Uhlig, and W. Willinger. Anatomy of a large european ixp. ACM SIGCOMM Computer Communication Review, 42(4):163--174, 2012. Google ScholarDigital Library
- B. Arzani, S. Ciraci, L. Chamon, Y. Zhu, H. H. Liu, J. Padhye, B. T. Loo, and G. Outhred. 007: Democratically finding the cause of packet drops. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), pages 419--435, Renton, WA, 2018. USENIX Association. Google ScholarDigital Library
- B. Arzani, S. Ciraci, B. T. Loo, A. Schuster, and G. Outhred. Taking the blame game out of data centers operations with netpoirot. In Proceedings of the 2016 ACM SIGCOMM Conference, pages 440--453. ACM, 2016. Google ScholarDigital Library
- B. Augustin, X. Cuvellier, B. Orgogozo, F. Viger, T. Friedman, M. Latapy, C. Magnien, and R. Teixeira. Avoiding traceroute anomalies with paris traceroute. In Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 153--158. ACM, 2006. Google ScholarDigital Library
- A. Broido and k. claffy. Analysis of RouteViews BGP data: policy atoms. In Network Resource Data Management Workshop, Santa Barbara, CA, May 2001.Google Scholar
- M. Calder, X. Fan, Z. Hu, E. Katz-Bassett, J. Heidemann, and R. Govindan. Mapping the expansion of google's serving infrastructure. In Proceedings of the 2013 conference on Internet measurement conference, pages 313--326. ACM, 2013. Google ScholarDigital Library
- M. Calder, R. Gao, M. Schröder, R. Stewart, J. Padhye, R. Mahajan, G. Ananthanarayanan, and E. Katz-Bassett. Odin: Microsoft's scalable fault-tolerant CDN measurement system. In 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI 18), Renton, WA, 2018. USENIX Association. Google ScholarDigital Library
- R. Castro, M. Coates, G. Liang, R. Nowak, and B. Yu. Network tomography: Recent developments. Statistical science, pages 499--517, 2004.Google Scholar
- F. Chen, R. K. Sitaraman, and M. Torres. End-user mapping: Next generation request routing for content delivery. In ACM SIGCOMM Computer Communication Review, volume 45, pages 167--181. ACM, 2015. Google ScholarDigital Library
- Í. Cunha, P. Marchetta, M. Calder, Y.-C. Chiu, B. Schlinker, B. V. Machado, A. Pescapè, V. Giotsas, H. V. Madhyastha, and E. Katz-Bassett. Sibyl: A practical internet route oracle. In NSDI, pages 325--344, 2016. Google ScholarDigital Library
- A. Dhamdhere, D. D. Clark, A. Gamero-Garrido, M. Luckie, R. K. Mok, G. Akiwate, K. Gogia, V. Bajpai, A. C. Snoeren, and K. Claffy. Inferring persistent interdomain congestion. In Proceedings of the 2018 Conference of the ACM Special Interest Group on Data Communication, pages 1--15. ACM, 2018. Google ScholarDigital Library
- N. Duffield. Network tomography of binary network performance characteristics. IEEE Transactions on Information Theory, 52(12):5373--5388, 2006. Google ScholarDigital Library
- A. Flavel, P. Mani, D. A. Maltz, N. Holt, J. Liu, Y. Chen, and O. Surmachev. Fastroute: A scalable load-aware anycast routing architecture for modern cdns. connections, 27:19, 2015.Google Scholar
- D. Ghita, K. Argyraki, and P. Thiran. Network tomography on correlated links. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 225--238. ACM, 2010. Google ScholarDigital Library
- D. Ghita, C. Karakus, K. Argyraki, and P. Thiran. Shifting network tomography toward a practical goal. In Proceedings of the Seventh COnference on Emerging Networking EXperiments and Technologies, CoNEXT '11, pages 24:1--24:12, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- V. Giotsas, C. Dietzel, G. Smaragdakis, A. Feldmann, A. Berger, and E. Aben. Detecting peering infrastructure outages in the wild. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 446--459. ACM, 2017. Google ScholarDigital Library
- O. Haq, M. Raja, and F. R. Dogar. Measuring and improving the reliability of wide-area cloud paths. In Proceedings of the 26th International Conference on World Wide Web, pages 253--262. International World Wide Web Conferences Steering Committee, 2017. Google ScholarDigital Library
- Y. He, M. Faloutsos, S. Krishnamurthy, and B. Huffaker. On routing asymmetry in the internet. In GLOBECOM'05. IEEE Global Telecommunications Conference, 2005., volume 2, pages 6--pp. IEEE, 2005.Google Scholar
- J. Jiang, R. Das, G. Ananthanarayanan, P. A. Chou, V. Padmanabhan, V. Sekar, E. Dominique, M. Goliszewski, D. Kukoleca, R. Vafin, et al. Via: Improving internet telephony call quality using predictive relay selection. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference, pages 286--299. ACM, 2016. Google ScholarDigital Library
- P. Kanuparthy and C. Dovrolis. Pythia: Diagnosing performance problems in wide area providers. In USENIX Annual Technical Conference, pages 371--382, 2014. Google ScholarDigital Library
- R. Krishnan, H. V. Madhyastha, S. Jain, S. Srinivasan, A. Krishnamurthy, T. Anderson, and J. Gao. Moving beyond end-to-end path information to optimize cdn performance. In Internet Measurement Conference (IMC), pages 190--201, Chicago, IL, 2009. Google ScholarDigital Library
- A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In ACM SIGCOMM Computer Communication Review, volume 34, pages 219--230. ACM, 2004. Google ScholarDigital Library
- F. Lau, S. H. Rubin, M. H. Smith, and L. Trajkovic. Distributed denial of service attacks. In Systems, Man, and Cybernetics, 2000 IEEE International Conference on, volume 3, pages 2275--2280. IEEE, 2000.Google Scholar
- Y. Lee and N. Spring. Identifying and aggregating homogeneous ipv4 /24 blocks with hobbit. In Internet Measurement Conference (IMC), Santa Monica, CA, 2016. Google ScholarDigital Library
- H. V. Madhyastha, T. Isdal, M. Piatek, C. Dixon, T. Anderson, A. Krishnamurthy, and A. Venkataramani. iplane: An information plane for distributed services. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 367--380. USENIX Association, 2006. Google ScholarDigital Library
- A. A. Mahimkar, Z. Ge, A. Shaikh, J. Wang, J. Yates, Y. Zhang, and Q. Zhao. Towards automated performance diagnosis in a large iptv network. In ACM SIGCOMM Computer Communication Review, volume 39, pages 231--242. ACM, 2009. Google ScholarDigital Library
- M. Mao, J. Rexford, J. Wang, and R. Katz. Towards an accurate as-level traceroute tool. In ACM SIGCOMM, 2003. Google ScholarDigital Library
- V. N. Padmanabhan, S. Ramabhadran, and J. Padhye. Netprofiler: Profiling wide-area networks using peer cooperation. In International Workshop on Peer-to-Peer Systems, pages 80--92. Springer, 2005. Google ScholarDigital Library
- L. Quan, J. Heidemann, and Y. Pradkin. Trinocular: Understanding internet reliability through adaptive probing. In ACM SIGCOMM Computer Communication Review, volume 43, pages 255--266. ACM, 2013. Google ScholarDigital Library
- A. Roy, H. Zeng, J. Bagga, and A. C. Snoeren. Passive realtime datacenter fault detection and localization. In NSDI, pages 595--612, 2017. Google ScholarDigital Library
- B. Schlinker, H. Kim, T. Cui, E. Katz-Bassett, H. V. Madhyastha, I. Cunha, J. Quinn, S. Hasan, P. Lapukhov, and H. Zeng. Engineering egress with edge fabric: Steering oceans of content to the world. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 418--431. ACM, 2017. Google ScholarDigital Library
- A. Singla, B. Chandrasekaran, P. Godfrey, and B. Maggs. The internet at the speed of light. In Proceedings of the 13th ACM Workshop on Hot Topics in Networks, page 1. ACM, 2014. Google ScholarDigital Library
- N. Spring, R. Mahajan, and T. Anderson. The causes of path inflation. In Proceedings of the 2003 conference on Applications, technologies, architectures, and protocols for computer communications, pages 113--124. ACM, 2003. Google ScholarDigital Library
- R. Steenbergen. A practical guide to (correctly) a practical guide to (correctly) troubleshooting with traceroute. In NANOG, 2017.Google Scholar
- V. Valancius, B. Ravi, N. Feamster, and A. C. Snoeren. Quantifying the benefits of joint content and network routing. In ACM SIGMETRICS Performance Evaluation Review, volume 41, pages 243--254. ACM, 2013. Google ScholarDigital Library
- K.-K. Yap, M. Motiwala, J. Rahe, S. Padgett, M. Holliman, G. Baldus, M. Hines, T. Kim, A. Narayanan, A. Jain, et al. Taking the edge off with espresso: Scale, reliability and programmability for global internet peering. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pages 432--445. ACM, 2017. Google ScholarDigital Library
- M. Zhang, C. Zhang, V. S. Pai, L. L. Peterson, and R. Y. Wang. Planetseer: Internet path failure monitoring and characterization in wide-area services. In OSDI, volume 4, pages 12--12, 2004. Google ScholarDigital Library
- Z. Zhang, M. Zhang, A. G. Greenberg, Y. C. Hu, R. Mahajan, and B. Christian. Optimizing cost and performance in online service provider networks. In NSDI, pages 33--48, 2010. Google ScholarDigital Library
Index Terms
- Zooming in on wide-area latencies to a global cloud provider
Recommendations
SLA-driven Elastic Cloud Hosting Provider
PDP '10: Proceedings of the 2010 18th Euromicro Conference on Parallel, Distributed and Network-based ProcessingIt is clear that Cloud computing is and will be a sea change for the Information Technology by changing the way in which both software and hardware are designed and purchased. In this work we address the use of this emerging computing paradigm into web ...
Efficient resource allocation for optimizing objectives of cloud users, IaaS provider and SaaS provider in cloud environment
The cloud architecture is usually composed of several XaaS layers--including Software as a Service (SaaS), Platform as a Service (PaaS) and Infrastructure as a Service (IaaS). The paper studies efficient resource allocation to optimize objectives of ...
Cloud Provider Transparency: An Empirical Evaluation
Cloud computing is quickly becoming the next wave of technological evolution as a new approach to providing IT capabilities needed by business. Driving interest and investment in cloud computing is the revolutionary change to the economic model. Cloud ...
Comments