ABSTRACT
Network reliability is critical for large clouds and online service providers like Microsoft. Our network is large, heterogeneous, complex and undergoes constant churns. In such an environment even small issues triggered by device failures, buggy device software, configuration errors, unproven management tools and unavoidable human errors can quickly cause large outages. A promising way to minimize such network outages is to proactively validate all network operations in a high-fidelity network emulator, before they are carried out in production. To this end, we present CrystalNet, a cloud-scale, high-fidelity network emulator. It runs real network device firmwares in a network of containers and virtual machines, loaded with production configurations. Network engineers can use the same management tools and methods to interact with the emulated network as they do with a production network. CrystalNet can handle heterogeneous device firmwares and can scale to emulate thousands of network devices in a matter of minutes. To reduce resource consumption, it carefully selects a boundary of emulations, while ensuring correctness of propagation of network changes. Microsoft's network engineers use CrystalNet on a daily basis to test planned network operations. Our experience shows that CrystalNet enables operators to detect many issues that could trigger significant outages.
Supplemental Material
- Cloudlab. https://www.cloudlab.us/.Google Scholar
- Emulab. https://www.emulab.net/.Google Scholar
- GNS3. https://www.gns3.com/.Google Scholar
- Introducing Data Center Fabric, the Next-Generation Facebook Data Center Network. https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook\-data-center-network/.Google Scholar
- Routing Design for Large Scale Datacenters: BGP is a better IGP! https://www.nanog.org/meetings/nanog55/presentations/Monday/Lapukhov.pdf.Google Scholar
- Al-Fares, M., Loukissas, A., and Vahdat, A. A Scalable, Commodity Data Center Network Architecture. In ACM SIGCOMM Computer Communication Review (2008), vol. 38, ACM, pp. 63--74. Google ScholarDigital Library
- Barefoot. P4 Software Switch. https://github.com/p4lang/behavioral-model/.Google Scholar
- Beckett, R., Gupta, A., Mahajan, R., and Walker, D. A general approach to network configuration verification. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication (2017), ACM, pp. 155--168. Google ScholarDigital Library
- Beckett, R., Mahajan, R., Millstein, T., Padhye, J., and Walker, D. Don't Mind the Gap: Bridging Network-wide Objectives and Device-level Configurations. In SIGCOMM (2016), ACM, pp. 328--341. Google ScholarDigital Library
- Bosshart, P., Daly, D., Gibb, G., Izzard, M., McKeown, N., Rexford, J., Schlesinger, C., Talayco, D., Vahdat, A., Varghese, G., et al. P4: Programming protocol-independent packet processors. ACM SIGCOMM Computer Communication Review 44, 3 (2014), 87--95. Google ScholarDigital Library
- Fayaz, S. K., Sharma, T., Fogel, A., Mahajan, R., Millstein, T., Sekar, V., and Varghese, G. Efficient Network Reachability Analysis using a Succinct Control Plane Representation. In OSDI (2016), USENIX Association, pp. 217--232. Google ScholarDigital Library
- Feamster, N., and Balakrishnan, H. Verifying the Correctness of Wide-Area Internet Routing.Google Scholar
- Fogel, A., Fung, S., Pedrosa, L., Walraed-Sullivan, M., Govindan, R., Mahajan, R., and Millstein, T. D. A General Approach to Network Configuration Analysis. In NSDI (2015), pp. 469--483. Google ScholarDigital Library
- Ford, B., Srisuresh, P., and Kegel, D. Peer-to-Peer Communication Across Network Address Translators. In ATC (2005), pp. 179--192. Google ScholarDigital Library
- Gember-Jacobson, A., Viswanathan, R., Akella, A., and Mahajan, R. Fast Control Plane Analysis using an Abstract Representation. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference (2016), ACM, pp. 300-- 313. Google ScholarDigital Library
- Google. Google Compute Engine Incident NO.16007. Connectivity issues in all regions. https://status.cloud.google.com/incident/compute/16007.Google Scholar
- Griffin, T. G., Shepherd, F. B., and Wilfong, G. The Stable Paths Problem and Interdomain Routing. IEEE/ACM Transactions on Networking (ToN) 10, 2 (2002), 232--243. Google ScholarDigital Library
- Handigol, N., Heller, B., Jeyakumar, V., Lantz, B., and McKeown, N. Reproducible Network Experiments using Container-Based Emulation. In Proceedings of the 8th international conference on Emerging networking experiments and technologies (2012), ACM, pp. 253--264. Google ScholarDigital Library
- Horn, A., Kheradmand, A., and Prasad, M. R. Delta-net: Real-time Network Verification Using Atoms. arXiv preprint arXiv:1702.07375 (2017). Google ScholarDigital Library
- Kang, H., and Tao, S. Container-based emulation of network control plane. In Proceedings of the Workshop on Hot Topics in Container Networking and Networked Systems (2017), ACM, pp. 24--29. Google ScholarDigital Library
- Kazemian, P., Varghese, G., and McKeown, N. Header Space Analysis: Static Checking for Networks. In NSDI (2012), vol. 12, pp. 113--126. Google ScholarDigital Library
- Khurshid, A., Zhou, W., Caesar, M., and Godfrey, P. Veriflow: Verifying Network-Wide Invariants in Real Time. ACM SIGCOMM Computer Communication Review 42, 4 (2012), 467--472. Google ScholarDigital Library
- Lopes, N. P., Bjørner, N., Godefroid, P., Jayaraman, K., and Varghese, G. Checking Beliefs in Dynamic Networks. In NSDI (2015), pp. 499--512. Google ScholarDigital Library
- Moy, J. T. OSPF: Anatomy of an Internet Routing Protocol. Addison-Wesley Professional, 1998. Google ScholarDigital Library
- Ousterhout, A., Perry, J., Balakrishnan, H., and Lapukhov, P. Flexplane: An experimentation platform for resource management in datacenters. In NSDI (2017), pp. 438-- 451. Google ScholarDigital Library
- Plotkin, G. D., Bjørner, N., Lopes, N. P., Rybalchenko, A., and Varghese, G. Scaling Network Verification using Symmetry and Surgery. In POPL (2016). Google ScholarDigital Library
- Premji, A., Lapukhov, P., and Mitchell, J. RFC 7938: Use of BGP for Routing in Large-Scale Data Centers, 2016.Google Scholar
- Sung, Y.-W. E., Tie, X., Wong, S. H., and Zeng, H. Robotron: Top-down Network Management at Facebook Scale. In Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference (2016), ACM, pp. 426--439. Google ScholarDigital Library
- Wette, P., Draxler, M., Schwabe, A., Wallaschek, F., Zahraee, M. H., and Karl, H. Maxinet: Distributed Emulation of Software-Defined Networks. In Networking Conference, 2014 IFIP (2014), IEEE, pp. 1--9.Google Scholar
- Yuan, L., Chen, H., Mai, J., Chuah, C.-N., Su, Z., and Mohapatra, P. Fireman: A Toolkit for Firewall Modeling and Analysis. In Security and Privacy, 2006 IEEE Symposium on (2006), IEEE, pp. 15--pp. Google ScholarDigital Library
- Zhai, E., Chen, R., Wolinsky, D. I., and Ford, B. Heading Off Correlated Failures through Independence-as-a-Service. In OSDI (2014), pp. 317--334. Google ScholarDigital Library
- Zhu, Y., Kang, N., Cao, J., Greenberg, A., Lu, G., Mahajan, R., Maltz, D., Yuan, L., Zhang, M., Zhao, B. Y., et al. Packet-Level Telemetry in Large Datacenter Networks. In ACM SIGCOMM Computer Communication Review (2015), vol. 45, ACM, pp. 479--491. Google ScholarDigital Library
Index Terms
- CrystalNet: Faithfully Emulating Large Production Networks
Recommendations
On the design and development of emulation platforms for NFV-based infrastructures
Network Functions Virtualisation (NFV) presents several advantages over traditional network architectures, such as flexibility, security, and reduced CAPEX/OPEX. In traditional middleboxes, network functions are usually executed on specialised hardware (...
A comparative study of network link emulators
SpringSim '09: Proceedings of the 2009 Spring Simulation MulticonferenceBetween discrete event simulation and evaluation within real networks, network emulation is a useful tool to study and evaluate the behaviour of applications. Using a real network as a basis to simulate another network's characteristics, it enables ...
A Framework for Reliability Computation of the IP Network
SNPD '07: Proceedings of the Eighth ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing - Volume 02As communication networks play an important role in today's life, network reliability becomes more critical than ever. Traditionally, network reliability is measured by connectivity or traffic volume. However, most of performance degradation problems ...
Comments