skip to main content
10.1145/3267809.3267838acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Kairos: Preemptive Data Center Scheduling Without Runtime Estimates

Published:11 October 2018Publication History

ABSTRACT

The vast majority of data center schedulers use task runtime estimates to improve the quality of their scheduling decisions. Knowledge about runtimes allows the schedulers, among other things, to achieve better load balance and to avoid head-of-line blocking. Obtaining accurate runtime estimates is, however, far from trivial, and erroneous estimates lead to sub-optimal scheduling decisions. Techniques to mitigate the effect of inaccurate estimates have shown some success, but the fundamental problem remains.

This paper presents Kairos, a novel data center scheduler that assumes no prior information on task runtimes. Kairos introduces a distributed approximation of the Least Attained Service (LAS) scheduling policy. Kairos consists of a centralized scheduler and per-node schedulers. The per-node schedulers implement LAS for tasks on their node, using preemption as necessary to avoid head-of-line blocking. The centralized scheduler distributes tasks among nodes in a manner that balances the load and imposes on each node a workload in which LAS provides favorable performance.

We have implemented Kairos in YARN. We compare its performance against the YARN FIFO scheduler and Big-C, an open-source state-of-the-art YARN-based scheduler that also uses preemption. Compared to YARN FIFO, Kairos reduces the median job completion time by 73% and the 99th percentile by 30%. Compared to Big-C, the improvements are 37% for the median and 57% for the 99th percentile. We evaluate Kairos at scale by implementing it in the Eagle simulator and comparing its performance against Eagle. Kairos improves the 99th percentile of short job completion times by up to 55% for the Google trace and 85% for the Yahoo trace.

References

  1. O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI'13, pages 185--198, Berkeley, CA, USA, 2013. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Avrahami and Y. Azar. Minimizing total flow time and total completion time with immediate dispatching. In Proceedings of the 15th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA'03, pages 11--18, New York, NY, USA, 2003. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI'14, pages 285--300, Broomfield, CO, Oct. 2014. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Chen, J. Rao, and X. Zhou. Preemptive, low latency datacenter scheduling via lightweight virtualization. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'17, pages 251--263, Santa Clara, CA, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment, 5(12):1802--1813, Aug. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The case for evaluating mapreduce performance using workload suites. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS'11, pages 390--399, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Coppa and I. Finocchi. On data skewness, stragglers, and mapreduce progress indicators. In Proceedings of the 6th ACM Symposium on Cloud Computing, SoCC'15, pages 139--152, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the ACM Symposium on Cloud Computing, SOCC'14, pages 2:1--2:14, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel. Job-aware scheduling in eagle: Divide and stick to your probes. In Proceedings of the Seventh ACM Symposium on Cloud Computing, number EPFL-CONF-221125, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'15, pages 499--510. USENIX Association, July 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'14, pages 127--144, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. G. Down and R. Wu. Multi-layered round robin routing for parallel servers. Queueing Systems, 53(4):177--188, Aug. 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: Guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys'12, pages 99--112, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Garefalakis, K. Karanasos, P. Pietzuch, A. Suresh, and S. Rao. Medea: Scheduling of long running applications in shared production clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys'18, pages 4:1--4:13, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Ghit and D. H. J. Epema. Tyrex: Size-based resource allocation in mapreduce frameworks. In IEEE/ACM 16th International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016, Cartagena, Colombia, May 16-19, 2016, pages 11--20, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 323--336, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. I. Gog, M. Schwarzkopf, A. Gleave, R. M. N. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In Proc. of OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, Aug. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan. Altruistic scheduling in multi-resource clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 65--80, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni. GRAPHENE: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 81--97, Savannah, GA, 2016. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Z. Hu, B. Li, Z. Qin, and R. S. M. Goh. Job scheduling without prior information in big data processing systems. In Proceedings of ICDCS, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  23. C.-C. Hung, L. Golubchik, and M. Yu. Scheduling jobs across geo-distributed datacenters. In Proceedings of the 6th Symposium on Cloud Computing, SoCC'15, pages 111--124, New York, NY, USA, 2015. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP'09, pages 261--276, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'15, pages 485--497, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In K. S. Candan, Y. C. 0001, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 25--36. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Nuyens and A. Wierman. The foreground-background queue: A survey. Performance evaluation, 65(3-4):286--307, Mar. 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP'13, pages 69--84, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. W. Park, A. Tumanov, A. Jiang, M. A. Kozuch, and G. R. Ganger. 3sigma: distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the 13th European Conference on Computer Systems, EuroSys'18, page 2. ACM, 2018. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. Rasley, K. Karanasos, S. Kandula, R. Fonseca, M. Vojnovic, and S. Rao. Efficient queue management for cluster scheduling. In Proceedings of the 11th European Conference on Computer Systems, EuroSys'16, page 36. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing, SoCC'12, pages 7:1--7:13, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687--690, 1968.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. L. E. Schrage and L. W. Miller. The queue m/g/1 with the shortest remaining processing time discipline. Operations Research, 14(4):670--684, Aug. 1966. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th European Conference on Computer Systems, EuroSys'13, pages 351--364, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the 11th European Conference on Computer Systems, EuroSys'16, page 35. ACM, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, I. n. Goiri, and R. Bianchini. History-based harvesting of spare cycles and storage in large-scale datacenters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 755--770, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Kairos: Preemptive Data Center Scheduling Without Runtime Estimates

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SoCC '18: Proceedings of the ACM Symposium on Cloud Computing
    October 2018
    546 pages
    ISBN:9781450360111
    DOI:10.1145/3267809

    Copyright © 2018 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 11 October 2018

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article
    • Research
    • Refereed limited

    Acceptance Rates

    Overall Acceptance Rate169of722submissions,23%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader