Abstract
Uninterrupted uptime is a critical aspect of Virtual Machines (VMs) offered by cloud hosting providers. Google's VMs run on top of rapidly changing infrastructure: we regularly update hardware and host software, and we must quickly respond to failing hardware. Frequent change is critical to both development velocity---deploying new versions of services and infrastructure---and the ability to respond rapidly to defects, including critical security fixes. Typically these updates would be disruptive, resulting in VM termination or restart. In this paper we present how we use VM live migration at scale to eliminate this disruption with minimal impact to the guest, performing over 1,000,0001migrations monthly in our production fleet, with 50ms median blackout, 300ms 99th percentile blackout.
- W.-D. W. Bianca Schroeder, Eduardo Pinheiro. DRAM Errors in the Wild: A Large-Scale Field Study. In SIGMETRICS/Performance, SIGMETRICS/Performance'09, Seattle, WA, USA, 2009. ACM. Google ScholarDigital Library
- C. Clark, K. Fraser, S. Hand, J. G. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield. Live Migration of Virtual Machines. In Proc of NSDI, 2005. Google ScholarDigital Library
- J. Corbet. fincore(). https://lwn.net/Articles/371538/, 2010.Google Scholar
- W.-D. W. Eduardo Pinheiro and L. A. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of the 5th USENIX Conference on File and Storage Technologies, FAST'07, 2007. Google ScholarDigital Library
- Google. Adding SSDs - Compute Engine - Google Cloud Platform. https://cloud.google.com/compute/docs/disks/local-ssd, 2016.Google Scholar
- Google. Storage Options - Compute Engine - Google Cloud Platform. https://cloud.google.com/compute/docs/disks/#pdspecs, 2016.Google Scholar
- Google. What is Google Compute Engine? - Compute Engine - Google Cloud Platform. https://cloud.google.com/compute/docs/, 2016.Google Scholar
- M. R. Hines, U. Deshpande, and K. Gopalan. Post-Copy Live Migration of Virtual Machines. volume 43, pages 14--26, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- K. Z. Ibrahim, S. Hofmeyr, C. Iancu, and E. Roman. Optimized Pre-copy Live Migration for Memory Intensive Applications. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pages 40:1--40:11, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Mashtizadeh, E. Celebi, T. Garfinkel, and M. Cai. The Design and Evolution of Live Storage Migration in VMware ESX. In Proceedings of the 2011 USENIX Conference on USENIX Annual Technical Conference, USENIXATC'11, pages 14--14, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- S. Nathan, U. Bellur, and P. Kulkarni. Towards a Comprehensive Performance Model of Virtual Machine Live Migration. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC '15, pages 288--301, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- M. Nelson, B.-H. Lim, and G. Hutchins. Fast Transparent Migration for Virtual Machines. In Proceedings of the Annual Conference on USENIX Annual Technical Conference, ATEC '05, pages 25--25, Berkeley, CA, USA, 2005. USENIX Association. Google ScholarDigital Library
- A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. Hlzle, S. Stuart, and A. Vahdat. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Googles Datacenter Network. In Sigcomm '15, 2015. Google ScholarDigital Library
- Song, Shi, Liu, Yang, and Chen. Parallelizing Live Migration of Virtual Machines. In Proceedings of the 9th ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE '13, pages 85--96, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- M. M. Theimer, K. A. Lantz, and D. R. Cheriton. Preemptable Remote Execution Facilities for the V-system. In Proceedings of the Tenth ACM Symposium on Operating Systems Principles, SOSP '85, pages 2--12, New York, NY, USA, 1985. ACM. Google ScholarDigital Library
- A. Verma, L. Pedrosa, M. R. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the European Conference on Computer Systems (EuroSys), Bordeaux, France, 2015. Google ScholarDigital Library
- VMware. vMotion Architecture, Performance, and Best Practices in VMware vSphere 5. https://www.vmware.com/files/pdf/vmotion-perf-vsphere5.pdf, 2011.Google Scholar
- VMware. VMware vSphere 5.1 vMotion Architecture, Performance, and Best Practices. https://www.vmware.com/files/pdf/techpaper/VMware-vSphere51-vMotion-Perf.pdf, 2012.Google Scholar
- S. V. Woudenberg. Lessons learned from a year of using live migration in production on Google Cloud. https://cloudplatform.googleblog.com/2016/04/lessons-learned-from-a-year-of-using-live-migration-in-production-on-Google-Cloud.html, 2016.Google Scholar
Recommendations
VM Live Migration At Scale
VEE '18: Proceedings of the 14th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution EnvironmentsUninterrupted uptime is a critical aspect of Virtual Machines (VMs) offered by cloud hosting providers. Google's VMs run on top of rapidly changing infrastructure: we regularly update hardware and host software, and we must quickly respond to failing ...
Network-centric Performance Improvement for Live VM Migration
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud ComputingLive Virtual Machine (VM) migrations are an important tool that is used in modern data centers in order to e.g. Consolidate server racks for maintenance or optimize VM placements across physical hosts. However, live VM migration causes a lot of network ...
Pre-Copy and post-copy VM live migration for memory intensive applications
Euro-Par'12: Proceedings of the 18th international conference on Parallel processing workshopsVirtualization technology provides a means for server consolidation, reducing the number of physical servers required for running a given workload. Virtual Machine (VM) live migration facilitates the transfer of a running (VM) between physical hosts ...
Comments