skip to main content
10.1145/3267809.3267830acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace

Published:11 October 2018Publication History

ABSTRACT

Cloud computing with large-scale datacenters provides great convenience and cost-efficiency for end users. However, the resource utilization of cloud datacenters is very low, which wastes a huge amount of infrastructure investment and energy to operate. To improve resource utilization, cloud providers usually co-locate workloads of different types on shared resources. However, resource sharing makes the quality of service (QoS) unguaranteed. In fact, improving resource utilization (IRU) and guaranteeing QoS at the same time in cloud has been a dilemma which we name an IRU-QoS curse. To tackle this issue, characterizing the workloads from real production cloud computing platforms is extremely important.

In this work, we analyze a recently released 24-hour trace dataset from a production cluster in Alibaba. We reveal three key findings which are significantly different from those from the Google trace. First, each online service runs in a container while batch jobs run on physical servers. Further, they are concurrently managed by two different schedulers and co-located on same servers, which we call semi-containerized co-location. Second, batch instances largely use the spare resources that containers reserved but not used, which shows the elasticity feature of resource allocation of the Alibaba cluster. Moreover, through resource overprovisioning, overbooking, and overcommitment, the resource allocation of the Alibaba cluster achieves high elasticity. Third, as the high elasticity may hurt the performance of co-located online services, the Alibaba cluster sets bounds of resources used by batch tasks to guarantee the steady performance of both online services and batch tasks, which we call plasticity of resource allocation.

References

  1. O. A. Abdul-Rahman and K. Aida. Towards understanding the usage behavior of google cloud users: The mice and elephants phenomenon. In 2014 IEEE 6th International Conference on Cloud Computing Technology and Science, pages 272--277, Dec 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alibaba Inc. Cluster data collected from production clusters in alibaba for cluster management research. https://github.com/alibaba/clusterdata, 2017.Google ScholarGoogle Scholar
  3. Alibaba Inc. Pouch - an efficient container engine. https://github.com/alibaba/pouch, 2017.Google ScholarGoogle Scholar
  4. O. Beaumont, L. Eyraud-Dubois, and J. A. L. del Castillo. Analyzing real cluster data for formulating allocation algorithms in cloud platforms. In 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing, pages 302--309, Oct 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Birke, L. Y. Chen, and E. Smirni. Data centers in the cloud: A large scale performance study. In 2012 IEEE Fifth International Conference on Cloud Computing, pages 336--343, June 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Birke, A. Podzimek, L. Y. Chen, and E. Smirni. State-of-the-practice in data center virtualization: Toward a better understanding of vm usage. In 2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 1--12, June 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation, OSDI'14, pages 285--300, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Burns, B. Grant, D. Oppenheimer, E. Brewer, and J. Wilkes. Borg, omega, and kubernetes. Commun. ACM, 59(5):50--57, Apr. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. Cano, S. Aiyar, and A. Krishnamurthy. Characterizing private clouds: A large-scale empirical analysis of enterprise clusters. In Proceedings of the Seventh ACM Symposium on Cloud Computing, SoCC '16, pages 29--41, New York, NY, USA, 2016. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Çavdar, A. Rosà, L. Y. Chen, and F. Binder, W. and Alagöz. Quantifying the brown side of priority schedulers: Lessons from big clusters. SIGMETRICS Perform. Eval. Rev., 42(3):76--81, Dec. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you are late don't blame us! In Proceedings of the International Symposium on Cloud Computing(SoCC), SoCC 14, pages 1--14, New York, NY, USA, 2014. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and L. A. Barroso. The tail at scale. Communications of the ACM, 56(2):74--80, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 499--510, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Delimitrou and C. Kozyrakis. Paragon: Qos-aware scheduling for heterogeneous datacenters. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ' 13, pages 77--88, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ' 14, pages 127--144, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Delimitrou and C. Kozyrakis. Hcloud: Resource-efficient provisioning in shared cloud systems. SIGARCH Comput. Archit. News, 44(2):473--488, Mar. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Di, D. Kondo, and F. Cappello. Characterizing cloud applications on a google data center. In 2013 42nd International Conference on Parallel Processing, pages 468--473, Oct 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Di, D. Kondo, and W. Cirne. Characterization and comparison of cloud versus grid workloads. In 2012 IEEE International Conference on Cluster Computing, pages 230--238, Sept 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. G., M. C., A. A., and G. A. Altruistic scheduling in multi-resource clusters. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI), OSDI'16, pages 65--80, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Garraghan, P. Townend, and J. Xu. An analysis of the server characteristics and resource utilization in google cloud. In 2013 IEEE International Conference on Cloud Engineering (IC2E), pages 124--131, March 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 323--336, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. R. Herbst, S. Kounev, and R. Reussner. Elasticity in cloud computing: What it is, and what it is not. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13), pages 23--27, 2013.Google ScholarGoogle Scholar
  23. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 295--308, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '15, pages 485--497, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Klein, M. Maggio, K. Arzen, and F. Hernandez-Rodriguez. Brownout: Building more robust cloud applications. In Proceedings of the 36th International Conference on Software Engineering, ICSE'14, pages 700--711, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Kubernetes. Production-grade container orchestration. https://kubernetes.io/, 2017.Google ScholarGoogle Scholar
  27. C. Lu, K. Ye, G. Xu, C. Z. Xu, and T. Bai. Imbalance in the cloud: An analysis on alibaba cluster trace. In 2017 IEEE International Conference on Big Data (Big Data), pages 2884--2892, Dec 2017.Google ScholarGoogle ScholarCross RefCross Ref
  28. K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP '13, pages 69--84, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. Rasheduzzaman, M. A. Islam, T. Islam, T. Hossain, and R. M. Rahman. Task shape classification and workload characterization of Google cluster trace. In 2014 IEEE International Advance Computing Conference (IACC), pages 893--898, Feb 2014.Google ScholarGoogle ScholarCross RefCross Ref
  30. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the Third ACM Symposium on Cloud Computing, SoCC '12, pages 7:1--7:13, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In SIGOPS European Conference on Computer Systems (EuroSys), pages 351--364, Prague, Czech Republic, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Shen, V. v. Beek, and A. Iosup. Statistical characterization of business-critical workloads hosted in cloud datacenters. In 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pages 465--474, May 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger. Tetrisched: Global reshceduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the EuroSys 2016, EuroSys'16, pages 233--248, New York, NY, USA, 2016. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. N. Vasić, D. Novaković, S. Miučin, D. Kostić, and R. Bianchini. Dejavu: Accelerating resource allocation in virtualized environments. In Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS XVII, pages 423--436, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing, SOCC '13, pages 5:1--5:16, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, pages 18:1--18:17, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Wilkes. More Google cluster data. Google research blog, Google Inc., 2011.Google ScholarGoogle Scholar
  38. Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow., 7(13):1393--1404, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Elasticity and Plasticity in Semi-Containerized Co-locating Cloud Workload: a View from Alibaba Trace

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
      October 2018
      546 pages
      ISBN:9781450360111
      DOI:10.1145/3267809

      Copyright © 2018 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 October 2018

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      Overall Acceptance Rate169of722submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader