research-article

Automating Job Monitoring System for an Ecosystem of High Performance Computing

Authors:
Kajornsak Piyoungkorn

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand
View Profile

,
Natsuda Kasisopha

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand
View Profile

,
Phithak Thaenkaew

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand
View Profile

,
Chalee Vorakulpipat

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand

National Electronics and Computer Technology Center, Khlong Nueng, Khlong Luang, Phathum Thani, Thailand
View Profile

MEDES '17: Proceedings of the 9th International Conference on Management of Digital EcoSystemsNovember 2017Pages 281–286https://doi.org/10.1145/3167020.3167062

Published:07 November 2017Publication History

MEDES '17: Proceedings of the 9th International Conference on Management of Digital EcoSystems

Pages 281–286

ABSTRACT

Many countries have founded national high performance computing center aiming to provide computational resources to their scientists upon requests. The resources provided are not efficient because the job requests are not relative to the real use leading to unnecessary resource consumption. In this paper, we present a method to monitor and manage High Performance Computing (HPC) resources more efficiently. Usually, the HPC resources are managed by a Portable Batch System (PBS) as the Job Management System (JMS) for effective job scheduling and resource allocation. However, the HPC resources often engage in inefficient job requests. For instance, a job request may have for four processors running per node for two hours, but the actual usage engages four processors per node for one hour. Hence, the HPC resources lose an hour of productivity. As a consequence, the queues for job execution are longer. The automated job monitoring system proposed in this paper would scan all the jobs on every HPC Node and compare the job requests conditions with preset criteria. If the conditions meet the criteria, then the inefficient jobs are forced to cancel from the HPC queue. The results show that more HPC resources are available for executing other jobs in the queue, leading to saved resources in the HPC environment and Stabilization of HPC hardware, promoting an HPC infrastructure ecosystem.

References

Aida, K. 2000. Effect of job size characteristics on job scheduling performance. Lecture notes in computer science. 1911, (2000), 1--17. Google ScholarDigital Library
CERN | Accelerating science: https://home.cern/. Accessed: 2017-06-24.Google Scholar
Downey, A.B. 1997. A parallel workload model and its implications for processor allocation. High Performance Distributed Computing, 1997. Proceedings. The Sixth IEEE International Symposium on (1997), 112--123. Google ScholarDigital Library
Feitelson, D.G. 1996. Packing schemes for gang scheduling. Job Scheduling Strategies for Parallel Processing: IPPS '96 Workshop Honolulu, Hawaii, April 16, 1996 Proceedings. D.G. Feitelson and L. Rudolph, eds. Springer Berlin Heidelberg. 89--110. Google ScholarDigital Library
Home - National e-Science Infrastructure Consortium: http://www.e-science.in.th/infra/. Accessed: 2016-07-06.Google Scholar
Hovestadt, M. et al. 2003. Scheduling in HPC Resource Management Systems: Queuing vs. Planning. Job Scheduling Strategies for Parallel Processing: 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003. Revised Paper. D. Feitelson et al., eds. Springer Berlin Heidelberg. 1--20.Google Scholar
Job Management Systems: http://www.cro-ngi.hr/en/technologies/cluster-technologies/job-management-systems/. Accessed: 2017-03-07.Google Scholar
Lifka, D.A. 1995. The anl/ibm sp scheduling system. Workshop on Job Scheduling Strategies for Parallel Processing (1995), 295--303. Google ScholarDigital Library
Lo, V. et al. 1998. A comparative study of real workload traces and synthetic workload models for parallel job scheduling. Workshop on Job Scheduling Strategies for Parallel Processing (1998), 25--46. Google ScholarDigital Library
Skovira, J. et al. 1996. The EASY---LoadLeveler API Project. Job Scheduling Strategies for Parallel Processing (1996), 41--47. Google ScholarDigital Library
Subhlok, J. et al. 1996. Impact of job mix on optimizations for space sharing schedulers. Supercomputing, 1996. Proceedings of the 1996 ACM/IEEE Conference on (1996), 54--54. Google ScholarDigital Library
V. Subramani et al. 2002. Distributed job scheduling on computational Grids using multiple simultaneous requests. Proceedings 11th IEEE International Symposium on High Performance Distributed Computing (2002), 359--366. Google ScholarDigital Library
Welcome to Python.org: https://www.python.org/. Accessed: 2017-06-26.Google Scholar
What is a Bash Script? - Bash Scripting Tutorial: http://ryanstutorials.net/bash-scripting-tutorial/bash-script.php. Accessed: 2017-06-26.Google Scholar
Yan, Y. and Chapman, B. 2008. Comparative Study of Distributed Resource Management Systems--SGE, LSF, PBS Pro, and LoadLeveler. Technical Report-Citeseerx. (2008).Google Scholar

Automating Job Monitoring System for an Ecosystem of High Performance Computing
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance

Recommendations

Evaluating parameter sweep workflows in high performance computing
SWEET '12: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies

Scientific experiments based on computer simulations can be defined, executed and monitored using Scientific Workflow Management Systems (SWfMS). Several SWfMS are available, each with a different goal and a different engine. Due to the exploratory ...
Read More
Job failures in high performance computing systems: A large-scale empirical study

The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth ...
Read More
Intelligent Job Scheduling on High Performance Computing Systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

MEDES '17: Proceedings of the 9th International Conference on Management of Digital EcoSystems
November 2017
299 pages
ISBN:9781450348959
DOI:10.1145/3167020

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 November 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Ineffective HPC Job detection
Job Monitoring System
Resource-saving High Performance Computing Management System
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
MEDES '17 Paper Acceptance Rate41of65submissions,63%Overall Acceptance Rate267of682submissions,39%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 40
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automating Job Monitoring System for an Ecosystem of High Performance Computing

MEDES '17: Proceedings of the 9th International Conference on Management of Digital EcoSystems

ABSTRACT

References

Cited By

Recommendations

Evaluating parameter sweep workflows in high performance computing

Job failures in high performance computing systems: A large-scale empirical study

Intelligent Job Scheduling on High Performance Computing Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automating Job Monitoring System for an Ecosystem of High Performance Computing

MEDES '17: Proceedings of the 9th International Conference on Management of Digital EcoSystems

ABSTRACT

References

Cited By

Recommendations

Evaluating parameter sweep workflows in high performance computing

Job failures in high performance computing systems: A large-scale empirical study

Intelligent Job Scheduling on High Performance Computing Systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media