research-article

Public Access

Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC systems by Utilization of Multiple Controllers and Node Sharing

Authors:
Nikolay A. Simakov

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Robert L. DeLeon

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Martins D. Innus

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Matthew D. Jones

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Joseph P. White

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Steven M. Gallo

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Abani K. Patra

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

,
Thomas R. Furlani

SUNY University at Buffalo, Buffalo, New York

SUNY University at Buffalo, Buffalo, New York
View Profile

PEARC '18: Proceedings of the Practice and Experience on Advanced Research ComputingJuly 2018Article No.: 25Pages 1–8https://doi.org/10.1145/3219104.3219111

Published:22 July 2018Publication History

PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing

Pages 1–8

ABSTRACT

A Slurm simulator was used to study the potential benefits of using multiple Slurm controllers and node-sharing on the TACC Stampede 2 system. Splitting a large cluster into smaller sub-clusters with separate Slurm controllers can offer better scheduling performance and better responsiveness due to an increased computational capability which increases the backfill scheduler efficiency. The disadvantage is additional hardware, more maintenance and an incapability to run jobs across the sub-clusters. Node sharing can increase system throughput by allowing several sub-node jobs to be executed on the same node. However, node sharing is more computationally demanding and might not be advantageous on larger systems. The Slurm simulator allows an estimation of the potential benefits from these configurations and provides information on the advantages to be expected from such a configuration deployment. In this work, multiple Slurm controllers and node-sharing were tested on a TACC Stampede 2 system consisting of two distinct node types: 4,200 Intel Xeon Phi Knights Landing (KNL) nodes and 1,736 Intel Xeon Skylake-X (SLX) nodes. For this system utilization of separate controllers for KNL and SLX nodes with node sharing allowed on SLX nodes resulted in a 40% reduction in waiting times for jobs executed on the SLX nodes. This improvement can be attributed to the better performance of the backfill scheduler. It scheduled 30% more SLX jobs, has a 30% reduction in the fraction of cycles that hit the time-limit and nearly doubles the jobs scheduling attempts.

References

Susanne M. Balle and Daniel J. Palermo. 2008. Enhancing an Open Source Resource Manager with Multi-core/Multi-threaded Support. In Job Scheduling Strategies for Parallel Processing, Eitan Frachtenberg and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 37--50. Google ScholarDigital Library
Alex D. Breslow, Leo Porter, Ananta Tiwari, Michael Laurenzano, Laura Carrington, Dean M. Tullsen, and Allan E. Snavely. 2013. The case for colocation of high performance computing workloads. Concurrency and Computation: Practice and Experience 28, 2 (2013), 232--251. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.3187 Google ScholarDigital Library
T. Evans, W. L. Barth, J. C. Browne, R. L. DeLeon, T. R. Furlani, S. M. Gallo, M. D. Jones, and A. K. Patra. 2014. Comprehensive Resource Use Monitoring for HPC Systems with TACC Stats. In 2014 First International Workshop on HPC User Support Tools. 13--21. Google ScholarDigital Library
J. T. Palmer, S. M. Gallo, T. R. Furlani, M. D. Jones, R. L. DeLeon, J. P. White, N. Simakov, A. K. Patra, J. Sperhac, T. Yearke, R. Rathsam, M. Innus, C. D. Cornelius, J. C. Browne, W. L. Barth, and R. T. Evans. 2015. Open XDMoD: A Tool for the Comprehensive Management of High-Performance Computing Resources. Computing in Science Engineering 17, 4 (July 2015), 52--62.Google ScholarDigital Library
Nikolay A. Simakov, Robert L. DeLeon, Joseph P. White, Thomas R. Furlani, Martins Innus, Steven M. Gallo, Matthew D. Jones, Abani Patra, Benjamin D. Plessinger, Jeanette Sperhac, Thomas Yearke, Ryan Rathsam, and Jeffrey T. Palmer. 2016. A Quantitative Analysis of Node Sharing on HPC Clusters Using XDMoD Application Kernels. In Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale (XSEDE16). ACM, New York, NY, USA, Article 32, 8 pages. Google ScholarDigital Library
Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L. DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and Parametric Analysis. In High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.). Springer International Publishing, Cham, 197--217.Google Scholar
Slurm Developers Team. 2018. Slurm: Large Cluster Administration Guide. Retrieved March 8, 2018 from https://slurm.schedmd.com/big_sys.htmlGoogle Scholar
Slurm Developers Team. 2018. Slurm Workload Manager. Retrieved March 8, 2018 from https://slurm.schedmd.com/Google Scholar
Joseph P. White, Robert L. DeLeon, Thomas R. Furlani, Steven M. Gallo, Matthew D. Jones, Amin Ghadersohi, Cynthia D. Cornelius, Abani K. Patra, James C. Browne, William L. Barth, and John Hammond. 2014. An Analysis of Node Sharing on HPC Clusters Using XDMoD/TACC_Stats. In Proceedings of the 2014 Annual Conference on Extreme Science and Engineering Discovery Environment (XSEDE '14). ACM, New York, NY, USA, Article 31, 8 pages. Google ScholarDigital Library
Andy B. Yoo, Morris A. Jette, and Mark Grondona. 2003. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing, Dror Feitelson, Larry Rudolph, and Uwe Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 44--60.Google Scholar

Index Terms

Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC systems by Utilization of Multiple Controllers and Node Sharing

Recommendations

Developing Accurate Slurm Simulator
PEARC '22: Practice and Experience in Advanced Research Computing

A new Slurm simulator compatible with the latest Slurm version has been produced. It was constructed by systematically transforming the Slurm code step by step to maintain the proper scheduler output realization while speeding up simulation time. To ...
Read More
Bounding completion times of jobs with arbitrary release times and variable execution times
RTSS '96: Proceedings of the 17th IEEE Real-Time Systems Symposium

In many real-time systems, the workload can be characterized as a set of jobs with linear precedence constraints among them. Jobs often have variable execution times and arbitrary release times. We describe three algorithms that can be used to compute ...
Read More
A genetic algorithm-based scheduler for multiproduct parallel machine sheet metal job shop

This paper presents a genetic algorithm-based job-shop scheduler for a flexible multi-product, parallel machine sheet metal job shop. Most of the existing research has focused only on permutation job shops in which the manufacturing sequence and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing
July 2018
652 pages
ISBN:9781450364461
DOI:10.1145/3219104
General Chair:
Sergiu Sanielevici
Pittsburgh Supercomputing Center
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HPC
Scheduling
Simulation
Workload
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
PEARC '18 Paper Acceptance Rate79of123submissions,64%Overall Acceptance Rate133of202submissions,66%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 658
  Total Downloads
- Downloads (Last 12 months)198
- Downloads (Last 6 weeks)32
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC systems by Utilization of Multiple Controllers and Node Sharing

PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Developing Accurate Slurm Simulator

Bounding completion times of jobs with arbitrary release times and variable execution times

A genetic algorithm-based scheduler for multiproduct parallel machine sheet metal job shop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Slurm Simulator: Improving Slurm Scheduler Performance on Large HPC systems by Utilization of Multiple Controllers and Node Sharing

PEARC '18: Proceedings of the Practice and Experience on Advanced Research Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Developing Accurate Slurm Simulator

Bounding completion times of jobs with arbitrary release times and variable execution times

A genetic algorithm-based scheduler for multiproduct parallel machine sheet metal job shop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media