skip to main content
10.1145/2807591.2807610acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

Published:15 November 2015Publication History

ABSTRACT

In today's batch queue HPC cluster systems, the user submits a job requesting a fixed number of processors. The system will not start the job until all of the requested resources become available simultaneously. When cluster workload is high, large sized jobs will experience long waiting time due to this policy. In this paper, we propose a new approach that dynamically decomposes a large job into smaller ones to reduce waiting time, and lets the application expand across multiple subjobs while continuously achieving progress. This approach has three benefits: (i) application turnaround time is reduced, (ii) system fragmentation is diminished, and (iii) fairness is promoted. Our approach does not depend on job queue time prediction but exploits available backfill opportunities. Simulation results have shown that our approach can reduce application mean turnaround time by up to 48%.

References

  1. https://bitbucket.org/francis_liu/pyss.Google ScholarGoogle Scholar
  2. Futuregrid. futuregrid.org.Google ScholarGoogle Scholar
  3. Parallel workloads archive. http://www.cs.huji.ac.il/labs/parallel/workload/.Google ScholarGoogle Scholar
  4. pyss - the python scheduler simulator. https://code.google.com/p/pyss/.Google ScholarGoogle Scholar
  5. J. Ansel, K. Aryay, and G. Coopermany. Dmtcp: Transparent checkpointing for cluster computations and the desktop. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--12. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, S. W. Williams, et al. The landscape of parallel computing research: A view from berkeley. Technical report, Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley, 2006.Google ScholarGoogle Scholar
  7. D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, et al. The nas parallel benchmarks. International Journal of High Performance Computing Applications, 5(3):63--73, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. W. Cirne and F. Berman. A model for moldable supercomputer jobs. In Parallel and Distributed Processing Symposium., Proceedings 15th International, pages 8--pp. IEEE, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. Cirne and F. Berman. Using moldability to improve the performance of supercomputer jobs. Journal of Parallel and Distributed Computing, 62(10):1571--1601, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Cirne and F. Berman. When the herd is smart: aggregate behavior in the selection of job request. Parallel and Distributed Systems, IEEE Transactions on, 14(2):181--192, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. B. Downey. Predicting queue times on space-sharing parallel computers. In Parallel Processing Symposium, 1997. Proceedings., 11th International, pages 209--218. IEEE, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. B. Downey. Using queue time predictions for processor allocation. In Job Scheduling Strategies for Parallel Processing, pages 35--57. Springer, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. G. Feitelson. Metric and workload effects on computer systems evaluation. Computer, 36(9):18--25, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Jha, M. Cole, D. S. Katz, M. Parashar, O. Rana, and J. Weissman. Distributed computing practice for large-scale science and engineering applications. Concurrency and Computation: Practice and Experience, 25(11):1559--1585, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  15. D. LaSalle and G. Karypis. Mpi for big data: New tricks for an old dog. Parallel Computing, 40(10):754--767, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H. Li, D. Groep, J. Templon, and L. Wolters. Predicting job start times on clusters. In Cluster Computing and the Grid, 2004. CCGrid 2004. IEEE International Symposium on, pages 301--308. IEEE, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. J. Lilja. Measuring computer performance: a practitioner's guide. Cambridge University Press, 2000. Google ScholarGoogle ScholarCross RefCross Ref
  18. F. Liu and J. Weissman. Elastic job bundling: An adaptive resource request strategy for large-scale parallel applications. Technical report, TR15-006, Department of Computer Science and Engineering, University of Minnesota, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Luckow, M. Santcroos, O. Weidner, A. Merzky, P. Mantha, and S. Jha. P*: A Model of Pilot-Abstractions. In 8th IEEE International Conference on e-Science 2012, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. W. Mu'alem and D. G. Feitelson. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. Parallel and Distributed Systems, IEEE Transactions on, 12(6):529--543, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Nurmi, J. Brevik, and R. Wolski. Qbets: queue bounds estimation from time series. In Job Scheduling Strategies for Parallel Processing, pages 76--101. Springer, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Ousterhout, A. Panda, J. Rosen, S. Venkataraman, R. Xin, S. Ratnasamy, S. Shenker, and I. Stoica. The case for tiny tasks in compute clusters.Google ScholarGoogle Scholar
  23. W. Smith, V. Taylor, and I. Foster. Using run-time predictions to estimate queue wait times and improve scheduler performance. In Job Scheduling Strategies for Parallel Processing, pages 202--219. Springer, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Staples. Torque resource manager. In Proceedings of the 2006 ACM/IEEE conference on Supercomputing, page 8. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Sudarsan and C. J. Ribbens. Reshape: A framework for dynamic resizing and scheduling of homogeneous applications in a parallel environment. In Parallel Processing, 2007. ICPP 2007. International Conference on, page 44. IEEE, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Sudarsan and C. J. Ribbens. Scheduling resizable parallel applications. In Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1--10. IEEE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. D. Tsafrir, Y. Etsion, and D. G. Feitelson. Backfilling using system-generated predictions rather than user runtime estimates. Parallel and Distributed Systems, IEEE Transactions on, 18(6):789--803, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Utrera, S. Tabik, J. Corbalan, and J. Labarta. A job scheduling approach for multi-core clusters based on virtual malleability. In Euro-Par 2012 Parallel Processing, pages 191--203. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. B. Weissman, L. R. Abburi, and D. England. Integrated scheduling: the best of both worlds. Journal of Parallel and Distributed Computing, 63(6):649--668, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Wierman. Revisiting the performance of large jobs in the M/GI/1 queue. In Proceedings of the Forty-Fifth Annual Allerton Conference On Communication, Control, and Computing, pages 607--614, 2007.Google ScholarGoogle Scholar
  31. R. Wolski. Experiences with predicting resource performance on-line in computational grid settings. ACM SIGMETRICS Performance Evaluation Review, 30(4):41--49, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. A. B. Yoo, M. A. Jette, and M. Grondona. Slurm: Simple linux utility for resource management. In Job Scheduling Strategies for Parallel Processing, pages 44--60. Springer, 2003.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Elastic job bundling: an adaptive resource request strategy for large-scale parallel applications

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
            November 2015
            985 pages
            ISBN:9781450337236
            DOI:10.1145/2807591
            • General Chair:
            • Jackie Kern,
            • Program Chair:
            • Jeffrey S. Vetter

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 15 November 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SC '15 Paper Acceptance Rate79of358submissions,22%Overall Acceptance Rate1,516of6,373submissions,24%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader