ABSTRACT
Many countries have founded national high performance computing center aiming to provide computational resources to their scientists upon requests. The resources provided are not efficient because the job requests are not relative to the real use leading to unnecessary resource consumption. In this paper, we present a method to monitor and manage High Performance Computing (HPC) resources more efficiently. Usually, the HPC resources are managed by a Portable Batch System (PBS) as the Job Management System (JMS) for effective job scheduling and resource allocation. However, the HPC resources often engage in inefficient job requests. For instance, a job request may have for four processors running per node for two hours, but the actual usage engages four processors per node for one hour. Hence, the HPC resources lose an hour of productivity. As a consequence, the queues for job execution are longer. The automated job monitoring system proposed in this paper would scan all the jobs on every HPC Node and compare the job requests conditions with preset criteria. If the conditions meet the criteria, then the inefficient jobs are forced to cancel from the HPC queue. The results show that more HPC resources are available for executing other jobs in the queue, leading to saved resources in the HPC environment and Stabilization of HPC hardware, promoting an HPC infrastructure ecosystem.
- Aida, K. 2000. Effect of job size characteristics on job scheduling performance. Lecture notes in computer science. 1911, (2000), 1--17. Google ScholarDigital Library
- CERN | Accelerating science: https://home.cern/. Accessed: 2017-06-24.Google Scholar
- Downey, A.B. 1997. A parallel workload model and its implications for processor allocation. High Performance Distributed Computing, 1997. Proceedings. The Sixth IEEE International Symposium on (1997), 112--123. Google ScholarDigital Library
- Feitelson, D.G. 1996. Packing schemes for gang scheduling. Job Scheduling Strategies for Parallel Processing: IPPS '96 Workshop Honolulu, Hawaii, April 16, 1996 Proceedings. D.G. Feitelson and L. Rudolph, eds. Springer Berlin Heidelberg. 89--110. Google ScholarDigital Library
- Home - National e-Science Infrastructure Consortium: http://www.e-science.in.th/infra/. Accessed: 2016-07-06.Google Scholar
- Hovestadt, M. et al. 2003. Scheduling in HPC Resource Management Systems: Queuing vs. Planning. Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper. D. Feitelson et al., eds. Springer Berlin Heidelberg. 1--20.Google Scholar
- Job Management Systems: http://www.cro-ngi.hr/en/technologies/cluster-technologies/job-management-systems/. Accessed: 2017-03-07.Google Scholar
- Lifka, D.A. 1995. The anl/ibm sp scheduling system. Workshop on Job Scheduling Strategies for Parallel Processing (1995), 295--303. Google ScholarDigital Library
- Lo, V. et al. 1998. A comparative study of real workload traces and synthetic workload models for parallel job scheduling. Workshop on Job Scheduling Strategies for Parallel Processing (1998), 25--46. Google ScholarDigital Library
- Skovira, J. et al. 1996. The EASY---LoadLeveler API Project. Job Scheduling Strategies for Parallel Processing (1996), 41--47. Google ScholarDigital Library
- Subhlok, J. et al. 1996. Impact of job mix on optimizations for space sharing schedulers. Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on (1996), 54--54. Google ScholarDigital Library
- V. Subramani et al. 2002. Distributed job scheduling on computational Grids using multiple simultaneous requests. Proceedings 11th IEEE International Symposium on High Performance Distributed Computing (2002), 359--366. Google ScholarDigital Library
- Welcome to Python.org: https://www.python.org/. Accessed: 2017-06-26.Google Scholar
- What is a Bash Script? - Bash Scripting Tutorial: http://ryanstutorials.net/bash-scripting-tutorial/bash-script.php. Accessed: 2017-06-26.Google Scholar
- Yan, Y. and Chapman, B. 2008. Comparative Study of Distributed Resource Management Systems--SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. (2008).Google Scholar
- Automating Job Monitoring System for an Ecosystem of High Performance Computing
Recommendations
Evaluating parameter sweep workflows in high performance computing
SWEET '12: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and TechnologiesScientific experiments based on computer simulations can be defined, executed and monitored using Scientific Workflow Management Systems (SWfMS). Several SWfMS are available, each with a different goal and a different engine. Due to the exploratory ...
Job failures in high performance computing systems: A large-scale empirical study
The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth ...
Comments