ABSTRACT
Cloud computing with large-scale datacenters provides great convenience and cost-efficiency for end users. However, the resource utilization of cloud datacenters is very low, which wastes a huge amount of infrastructure investment and energy to operate. To improve resource utilization, cloud providers usually co-locate workloads of different types on shared resources. However, resource sharing makes the quality of service (QoS) unguaranteed. In fact, improving resource utilization (IRU) and guaranteeing QoS at the same time in cloud has been a dilemma which we name an IRU-QoS curse. To tackle this issue, characterizing the workloads from real production cloud computing platforms is extremely important.
In this work, we analyze a recently released 24-hour trace dataset from a production cluster in Alibaba. We reveal three key findings which are significantly different from those from the Google trace. First, each online service runs in a container while batch jobs run on physical servers. Further, they are concurrently managed by two different schedulers and co-located on same servers, which we call semi-containerized co-location. Second, batch instances largely use the spare resources that containers reserved but not used, which shows the elasticity feature of resource allocation of the Alibaba cluster. Moreover, through resource overprovisioning, overbooking, and overcommitment, the resource allocation of the Alibaba cluster achieves high elasticity. Third, as the high elasticity may hurt the performance of co-located online services, the Alibaba cluster sets bounds of resources used by batch tasks to guarantee the steady performance of both online services and batch tasks, which we call plasticity of resource allocation.
- O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of google cloud users: The mice and elephants phenomenon. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, pages 272--277, Dec 2014. Google ScholarDigital Library
- Alibaba Inc. Cluster data collected from production clusters in alibaba for cluster management research. https://github.com/alibaba/clusterdata, 2017.Google Scholar
- Alibaba Inc. Pouch - an efficient container engine. https://github.com/alibaba/pouch, 2017.Google Scholar
- O. Beaumont, L. Eyraud-Dubois, and J. A. L. del Castillo. Analyzing real cluster data for formulating allocation algorithms in cloud platforms. In 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pages 302--309, Oct 2014. Google ScholarDigital Library
- R. Birke, L. Y. Chen, and E. Smirni. Data centers in the cloud: A large scale performance study. In 2012 IEEE Fifth International Conference on Cloud Computing, pages 336--343, June 2012. Google ScholarDigital Library
- R. Birke, A. Podzimek, L. Y. Chen, and E. Smirni. State-of-the-practice in data center virtualization: Toward a better understanding of vm usage. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1--12, June 2013. Google ScholarDigital Library
- E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 285--300, 2014. Google ScholarDigital Library
- B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. Borg, omega, and kubernetes. Commun. ACM, 59(5):50--57, Apr. 2016. Google ScholarDigital Library
- I. Cano, S. Aiyar, and A. Krishnamurthy. Characterizing private clouds: A large-scale empirical analysis of enterprise clusters. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, pages 29--41, New York, NY, USA, 2016. ACM. Google ScholarDigital Library
- D. Çavdar, A. Rosà, L. Y. Chen, and F. Binder, W. and Alagöz. Quantifying the brown side of priority schedulers: Lessons from big clusters. SIGMETRICS Perform. Eval. Rev., 42(3):76--81, Dec. 2014. Google ScholarDigital Library
- C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you are late don't blame us! In Proceedings of the International Symposium on Cloud Computing(SoCC), SoCC 14, pages 1--14, New York, NY, USA, 2014. ACM Press. Google ScholarDigital Library
- J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarDigital Library
- P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 499--510, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ' 13, pages 77--88, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ' 14, pages 127--144, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Hcloud: Resource-efficient provisioning in shared cloud systems. SIGARCH Comput. Archit. News, 44(2):473--488, Mar. 2016. Google ScholarDigital Library
- S. Di, D. Kondo, and F. Cappello. Characterizing cloud applications on a google data center. In 2013 42nd International Conference on Parallel Processing, pages 468--473, Oct 2013. Google ScholarDigital Library
- S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing, pages 230--238, Sept 2012. Google ScholarDigital Library
- R. G., M. C., A. A., and G. A. Altruistic scheduling in multi-resource clusters. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), OSDI'16, pages 65--80, 2016. Google ScholarDigital Library
- P. Garraghan, P. Townend, and J. Xu. An analysis of the server characteristics and resource utilization in google cloud. In 2013 IEEE International Conference on Cloud Engineering (IC2E), pages 124--131, March 2013. Google ScholarDigital Library
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 323--336, 2011. Google ScholarDigital Library
- N. R. Herbst, S. Kounev, and R. Reussner. Elasticity in cloud computing: What it is, and what it is not. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), pages 23--27, 2013.Google Scholar
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 295--308, 2011. Google ScholarDigital Library
- K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 485--497, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
- C. Klein, M. Maggio, K. Arzen, and F. Hernandez-Rodriguez. Brownout: Building more robust cloud applications. In Proceedings of the 36th International Conference on Software Engineering, ICSE'14, pages 700--711, 2014. Google ScholarDigital Library
- Kubernetes. Production-grade container orchestration. https://kubernetes.io/, 2017.Google Scholar
- C. Lu, K. Ye, G. Xu, C. Z. Xu, and T. Bai. Imbalance in the cloud: An analysis on alibaba cluster trace. In 2017 IEEE International Conference on Big Data (Big Data), pages 2884--2892, Dec 2017.Google ScholarCross Ref
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 69--84, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- M. Rasheduzzaman, M. A. Islam, T. Islam, T. Hossain, and R. M. Rahman. Task shape classification and workload characterization of Google cluster trace. In 2014 IEEE International Advance Computing Conference (IACC), pages 893--898, Feb 2014.Google ScholarCross Ref
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 7:1--7:13, 2012. Google ScholarDigital Library
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems (EuroSys), pages 351--364, Prague, Czech Republic, 2013. Google ScholarDigital Library
- S. Shen, V. v. Beek, and A. Iosup. Statistical characterization of business-critical workloads hosted in cloud datacenters. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 465--474, May 2015.Google ScholarDigital Library
- A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger. Tetrisched: Global reshceduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the EuroSys 2016, EuroSys'16, pages 233--248, New York, NY, USA, 2016. ACM Press. Google ScholarDigital Library
- N. Vasić, D. Novaković, S. Miučin, D. Kostić, and R. Bianchini. Dejavu: Accelerating resource allocation in virtualized environments. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 423--436, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 5:1--5:16, 2013. Google ScholarDigital Library
- A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 18:1--18:17, 2015. Google ScholarDigital Library
- J. Wilkes. More Google cluster data. Google research blog, Google Inc., 2011.Google Scholar
- Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow., 7(13):1393--1404, Aug. 2014. Google ScholarDigital Library
Index Terms
The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace
Recommendations
Cloud resource provisioning: survey, status and future research directions
Cloud resource provisioning is a challenging job that may be compromised due to unavailability of the expected resources. Quality of Service (QoS) requirements of workloads derives the provisioning of appropriate resources to cloud workloads. Discovery ...
A Constrained Genetic Algorithm for Rebalancing of Services in Cloud Data Centers
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud ComputingIn Infrastructure-as-a-Service cloud data centers, services are provided to cloud customers in the form of virtual machines. Cloud customers can place restrictions on these services by specifying affinity and anti-affinity constraints. Load imbalance is ...
Optimal cloud resource auto-scaling for web applications
CCGRID '13: Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid ComputingIn the on-demand cloud environment, web application providers have the potential to scale virtual resources up or down to achieve cost-effective outcomes. True elasticity and cost-effectiveness in the pay-per-use cloud business model, however, have not ...
Comments