research-article

Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

Authors:
Feng Liu

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

,
Jon B. Weissman

University of Minnesota, Minneapolis, MN

University of Minnesota, Minneapolis, MN
View Profile

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and AnalysisNovember 2015Article No.: 33Pages 1–12https://doi.org/10.1145/2807591.2807610

Published:15 November 2015Publication History

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.

References

https://bitbucket.org/francis_liu/pyss.Google Scholar
Futuregrid. futuregrid.org.Google Scholar
Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/workload/.Google Scholar
pyss - the python scheduler simulator. https://code.google.com/p/pyss/.Google Scholar
J. Ansel, K. Aryay, and G. Coopermany. Dmtcp: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12. IEEE, 2009. Google ScholarDigital Library
K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al. The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.Google Scholar
D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991. Google ScholarDigital Library
W. Cirne and F. Berman. A model for moldable supercomputer jobs. In Parallel and Distributed Processing Symposium., Proceedings 15th International, pages 8--pp. IEEE, 2001. Google ScholarDigital Library
W. Cirne and F. Berman. Using moldability to improve the performance of supercomputer jobs. Journal of Parallel and Distributed Computing, 62(10):1571--1601, 2002. Google ScholarDigital Library
W. Cirne and F. Berman. When the herd is smart: aggregate behavior in the selection of job request. Parallel and Distributed Systems, IEEE Transactions on, 14(2):181--192, 2003. Google ScholarDigital Library
A. B. Downey. Predicting queue times on space-sharing parallel computers. In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages 209--218. IEEE, 1997. Google ScholarDigital Library
A. B. Downey. Using queue time predictions for processor allocation. In Job Scheduling Strategies for Parallel Processing, pages 35--57. Springer, 1997. Google ScholarDigital Library
D. G. Feitelson. Metric and workload effects on computer systems evaluation. Computer, 36(9):18--25, 2003. Google ScholarDigital Library
S. Jha, M. Cole, D. S. Katz, M. Parashar, O. Rana, and J. Weissman. Distributed computing practice for large-scale science and engineering applications. Concurrency and Computation: Practice and Experience, 25(11):1559--1585, 2013.Google ScholarCross Ref
D. LaSalle and G. Karypis. Mpi for big data: New tricks for an old dog. Parallel Computing, 40(10):754--767, 2014.Google ScholarDigital Library
H. Li, D. Groep, J. Templon, and L. Wolters. Predicting job start times on clusters. In Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on, pages 301--308. IEEE, 2004. Google ScholarDigital Library
D. J. Lilja. Measuring computer performance: a practitioner's guide. Cambridge University Press, 2000. Google ScholarCross Ref
F. Liu and J. Weissman. Elastic job bundling: An adaptive resource request strategy for large-scale parallel applications. Technical report, TR15-006, Department of Computer Science and Engineering, University of Minnesota, 2015.Google ScholarDigital Library
A. Luckow, M. Santcroos, O. Weidner, A. Merzky, P. Mantha, and S. Jha. P*: A Model of Pilot-Abstractions. In 8th IEEE International Conference on e-Science 2012, 2012. Google ScholarDigital Library
A. W. Mu'alem and D. G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel and Distributed Systems, IEEE Transactions on, 12(6):529--543, 2001. Google ScholarDigital Library
D. Nurmi, J. Brevik, and R. Wolski. Qbets: queue bounds estimation from time series. In Job Scheduling Strategies for Parallel Processing, pages 76--101. Springer, 2008. Google ScholarDigital Library
K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters.Google Scholar
W. Smith, V. Taylor, and I. Foster. Using run-time predictions to estimate queue wait times and improve scheduler performance. In Job Scheduling Strategies for Parallel Processing, pages 202--219. Springer, 1999. Google ScholarDigital Library
G. Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 8. ACM, 2006. Google ScholarDigital Library
R. Sudarsan and C. J. Ribbens. Reshape: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment. In Parallel Processing, 2007. ICPP 2007. International Conference on, page 44. IEEE, 2007. Google ScholarDigital Library
R. Sudarsan and C. J. Ribbens. Scheduling resizable parallel applications. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--10. IEEE, 2009. Google ScholarDigital Library
D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling using system-generated predictions rather than user runtime estimates. Parallel and Distributed Systems, IEEE Transactions on, 18(6):789--803, 2007. Google ScholarDigital Library
G. Utrera, S. Tabik, J. Corbalan, and J. Labarta. A job scheduling approach for multi-core clusters based on virtual malleability. In Euro-Par 2012 Parallel Processing, pages 191--203. Springer, 2012. Google ScholarDigital Library
J. B. Weissman, L. R. Abburi, and D. England. Integrated scheduling: the best of both worlds. Journal of Parallel and Distributed Computing, 63(6):649--668, 2003. Google ScholarDigital Library
A. Wierman. Revisiting the performance of large jobs in the M/GI/1 queue. In Proceedings of the Forty-Fifth Annual Allerton Conference On Communication, Control, and Computing, pages 607--614, 2007.Google Scholar
R. Wolski. Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4):41--49, 2003. Google ScholarDigital Library
A. B. Yoo, M. A. Jette, and M. Grondona. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, pages 44--60. Springer, 2003.Google ScholarCross Ref

Index Terms

Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

Recommendations

Improvements in Parallel Job Scheduling Using Gang Service
ISPAN '99: Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks

Gang scheduling has been widely used as a practical solution to the dynamic parallel job scheduling problem. Parallel threads of a single job are scheduled for simultaneous execution on a parallel computer even if the job does not fully utilize all ...
Read More
A Multi-constraint Preemption Algorithm for Parallel Job Scheduling
PDCAT '11: Proceedings of the 2011 12th International Conference on Parallel and Distributed Computing, Applications and Technologies

Using effective scheduling strategies to improve turnaround time, slowdown, and utilization is an important consideration in large supercomputing environments. Since such machines have traditionally used non-preemption strategies to accommodate multiple ...
Read More
Selective preemption strategies for parallel job scheduling

Although theoretical results have been established regarding the utility of preemptive scheduling in reducing average job turnaround time, job suspension/restart is not much used in practice at supercomputer centres for parallel job scheduling. A number ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2015
985 pages
ISBN:9781450337236
DOI:10.1145/2807591
General Chair:
Jackie Kern
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Jeffrey S. Vetter
Oak Ridge National Laboratory and Georgia Institute of Technology, Oak Ridge, Tennessee
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 November 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HPC
elasticity
parallel job scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 299
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improvements in Parallel Job Scheduling Using Gang Service

A Multi-constraint Preemption Algorithm for Parallel Job Scheduling

Selective preemption strategies for parallel job scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improvements in Parallel Job Scheduling Using Gang Service

A Multi-constraint Preemption Algorithm for Parallel Job Scheduling

Selective preemption strategies for parallel job scheduling

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media