ABSTRACT
The vast majority of data center schedulers use task runtime estimates to improve the quality of their scheduling decisions. Knowledge about runtimes allows the schedulers, among other things, to achieve better load balance and to avoid head-of-line blocking. Obtaining accurate runtime estimates is, however, far from trivial, and erroneous estimates lead to sub-optimal scheduling decisions. Techniques to mitigate the effect of inaccurate estimates have shown some success, but the fundamental problem remains.
This paper presents Kairos, a novel data center scheduler that assumes no prior information on task runtimes. Kairos introduces a distributed approximation of the Least Attained Service (LAS) scheduling policy. Kairos consists of a centralized scheduler and per-node schedulers. The per-node schedulers implement LAS for tasks on their node, using preemption as necessary to avoid head-of-line blocking. The centralized scheduler distributes tasks among nodes in a manner that balances the load and imposes on each node a workload in which LAS provides favorable performance.
We have implemented Kairos in YARN. We compare its performance against the YARN FIFO scheduler and Big-C, an open-source state-of-the-art YARN-based scheduler that also uses preemption. Compared to YARN FIFO, Kairos reduces the median job completion time by 73% and the 99th percentile by 30%. Compared to Big-C, the improvements are 37% for the median and 57% for the 99th percentile. We evaluate Kairos at scale by implementing it in the Eagle simulator and comparing its performance against Eagle. Kairos improves the 99th percentile of short job completion times by up to 55% for the Google trace and 85% for the Yahoo trace.
- O. Alipourfard, H. H. Liu, J. Chen, S. Venkataraman, M. Yu, and M. Zhang. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation, NSDI'17, pages 469--482, Berkeley, CA, USA, 2017. USENIX Association. Google ScholarDigital Library
- G. Ananthanarayanan, A. Ghodsi, S. Shenker, and I. Stoica. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, NSDI'13, pages 185--198, Berkeley, CA, USA, 2013. USENIX Association. Google ScholarDigital Library
- N. Avrahami and Y. Azar. Minimizing total flow time and total completion time with immediate dispatching. In Proceedings of the 15th Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA'03, pages 11--18, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou. Apollo: Scalable and coordinated scheduling for cloud-scale computing. In 11th USENIX Symposium on Operating Systems Design and Implementation, OSDI'14, pages 285--300, Broomfield, CO, Oct. 2014. USENIX Association. Google ScholarDigital Library
- W. Chen, J. Rao, and X. Zhou. Preemptive, low latency datacenter scheduling via lightweight virtualization. In Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'17, pages 251--263, Santa Clara, CA, 2017. Google ScholarDigital Library
- Y. Chen, S. Alspaugh, and R. Katz. Interactive analytical processing in big data systems: A cross-industry study of mapreduce workloads. Proceedings of the VLDB Endowment, 5(12):1802--1813, Aug. 2012. Google ScholarDigital Library
- Y. Chen, A. Ganapathi, R. Griffith, and R. Katz. The case for evaluating mapreduce performance using workload suites. In Proceedings of the 2011 IEEE 19th Annual International Symposium on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS'11, pages 390--399, Washington, DC, USA, 2011. IEEE Computer Society. Google ScholarDigital Library
- E. Coppa and I. Finocchi. On data skewness, stragglers, and mapreduce progress indicators. In Proceedings of the 6th ACM Symposium on Cloud Computing, SoCC'15, pages 139--152, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the ACM Symposium on Cloud Computing, SOCC'14, pages 2:1--2:14, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- P. Delgado, D. Didona, F. Dinu, and W. Zwaenepoel. Job-aware scheduling in eagle: Divide and stick to your probes. In Proceedings of the Seventh ACM Symposium on Cloud Computing, number EPFL-CONF-221125, 2016. Google ScholarDigital Library
- P. Delgado, F. Dinu, A.-M. Kermarrec, and W. Zwaenepoel. Hawk: Hybrid datacenter scheduling. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'15, pages 499--510. USENIX Association, July 2015. Google ScholarDigital Library
- C. Delimitrou and C. Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS'14, pages 127--144, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- D. G. Down and R. Wu. Multi-layered round robin routing for parallel servers. Queueing Systems, 53(4):177--188, Aug. 2006. Google ScholarDigital Library
- A. D. Ferguson, P. Bodik, S. Kandula, E. Boutin, and R. Fonseca. Jockey: Guaranteed job latency in data parallel clusters. In Proceedings of the 7th ACM European Conference on Computer Systems, EuroSys'12, pages 99--112, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- P. Garefalakis, K. Karanasos, P. Pietzuch, A. Suresh, and S. Rao. Medea: Scheduling of long running applications in shared production clusters. In Proceedings of the 13th European Conference on Computer Systems, EuroSys'18, pages 4:1--4:13, 2018. Google ScholarDigital Library
- B. Ghit and D. H. J. Epema. Tyrex: Size-based resource allocation in mapreduce frameworks. In IEEE/ACM 16th International Symposium on Cluster, Cloud and Grid Computing, CCGrid 2016, Cartagena, Colombia, May 16-19, 2016, pages 11--20, 2016.Google ScholarDigital Library
- A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation, NSDI'11, pages 323--336, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
- I. Gog, M. Schwarzkopf, A. Gleave, R. M. N. Watson, and S. Hand. Firmament: Fast, centralized cluster scheduling at scale. In Proc. of OSDI, 2016. Google ScholarDigital Library
- R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. ACM SIGCOMM Computer Communication Review, 44(4):455--466, Aug. 2014. Google ScholarDigital Library
- R. Grandl, M. Chowdhury, A. Akella, and G. Ananthanarayanan. Altruistic scheduling in multi-resource clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 65--80, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarDigital Library
- R. Grandl, S. Kandula, S. Rao, A. Akella, and J. Kulkarni. GRAPHENE: Packing and dependency-aware scheduling for data-parallel clusters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 81--97, Savannah, GA, 2016. USENIX Association. Google ScholarDigital Library
- Z. Hu, B. Li, Z. Qin, and R. S. M. Goh. Job scheduling without prior information in big data processing systems. In Proceedings of ICDCS, 2017.Google ScholarCross Ref
- C.-C. Hung, L. Golubchik, and M. Yu. Scheduling jobs across geo-distributed datacenters. In Proceedings of the 6th Symposium on Cloud Computing, SoCC'15, pages 111--124, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- M. Isard, V. Prabhakaran, J. Currey, U. Wieder, K. Talwar, and A. Goldberg. Quincy: Fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22Nd Symposium on Operating Systems Principles, SOSP'09, pages 261--276, 2009. Google ScholarDigital Library
- K. Karanasos, S. Rao, C. Curino, C. Douglas, K. Chaliparambil, G. M. Fumarola, S. Heddaya, R. Ramakrishnan, and S. Sakalanaga. Mercury: Hybrid centralized and distributed scheduling in large shared clusters. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC'15, pages 485--497, Berkeley, CA, USA, 2015. USENIX Association. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. A. Rolia. Skewtune: mitigating skew in mapreduce applications. In K. S. Candan, Y. C. 0001, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, SIGMOD Conference, pages 25--36. ACM, 2012. Google ScholarDigital Library
- M. Nuyens and A. Wierman. The foreground-background queue: A survey. Performance evaluation, 65(3-4):286--307, Mar. 2008. Google ScholarDigital Library
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: Distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, SOSP'13, pages 69--84, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. W. Park, A. Tumanov, A. Jiang, M. A. Kozuch, and G. R. Ganger. 3sigma: distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the 13th European Conference on Computer Systems, EuroSys'18, page 2. ACM, 2018. Google ScholarDigital Library
- J. Rasley, K. Karanasos, S. Kandula, R. Fonseca, M. Vojnovic, and S. Rao. Efficient queue management for cluster scheduling. In Proceedings of the 11th European Conference on Computer Systems, EuroSys'16, page 36. ACM, 2016. Google ScholarDigital Library
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In Proceedings of the 3rd ACM Symposium on Cloud Computing, SoCC'12, pages 7:1--7:13, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- L. Schrage. A proof of the optimality of the shortest remaining processing time discipline. Operations Research, 16(3):687--690, 1968.Google ScholarDigital Library
- L. E. Schrage and L. W. Miller. The queue m/g/1 with the shortest remaining processing time discipline. Operations Research, 14(4):670--684, Aug. 1966. Google ScholarDigital Library
- M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th European Conference on Computer Systems, EuroSys'13, pages 351--364, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- A. Tumanov, T. Zhu, J. W. Park, M. A. Kozuch, M. Harchol-Balter, and G. R. Ganger. Tetrisched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters. In Proceedings of the 11th European Conference on Computer Systems, EuroSys'16, page 35. ACM, 2016. Google ScholarDigital Library
- Y. Zhang, G. Prekas, G. M. Fumarola, M. Fontoura, I. n. Goiri, and R. Bianchini. History-based harvesting of spare cycles and storage in large-scale datacenters. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation, OSDI'16, pages 755--770, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarDigital Library
Index Terms
- Kairos: Preemptive Data Center Scheduling Without Runtime Estimates
Recommendations
Job-aware Scheduling in Eagle: Divide and Stick to Your Probes
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud ComputingWe present Eagle, a new hybrid data center scheduler for data-parallel programs. Eagle dynamically divides the nodes of the data center in partitions for the execution of long and short jobs, thereby avoiding head-of-line blocking. Furthermore, it ...
Forget the Deadline: Scheduling Interactive Applications in Data Centers
CLOUD '15: Proceedings of the 2015 IEEE 8th International Conference on Cloud ComputingMany interactive applications running in data centers such as web search, social networks, online gaming, and financial services are delay-sensitive, and often have a deadline. These deadlines vary across users and applications, which makes the job ...
Scheduling of deteriorating jobs with release dates to minimize the maximum lateness
In this paper, we consider the problem of scheduling n deteriorating jobs with release dates on a single (batching) machine. Each job's processing time is a simple linear function of its starting time. The objective is to minimize the maximum lateness. ...
Comments