skip to main content
research-article

Idempotent work stealing

Published: 14 February 2009 Publication History

Abstract

Load balancing is a technique which allows efficient parallelization of irregular workloads, and a key component of many applications and parallelizing runtimes. Work-stealing is a popular technique for implementing load balancing, where each parallel thread maintains its own work set of items and occasionally steals items from the sets of other threads.
The conventional semantics of work stealing guarantee that each inserted task is eventually extracted exactly once. However, correctness of a wide class of applications allows for relaxed semantics, because either: i) the application already explicitly checks that no work is repeated or ii) the application can tolerate repeated work.
In this paper, we introduce idempotent work tealing, and present several new algorithms that exploit the relaxed semantics to deliver better performance. The semantics of the new algorithms guarantee that each inserted task is eventually extracted at least once-instead of exactly once.
On mainstream processors, algorithms for conventional work stealing require special atomic instructions or store-load memory ordering fence instructions in the owner's critical path operations. In general, these instructions are substantially slower than regular memory access instructions. By exploiting the relaxed semantics, our algorithms avoid these instructions in the owner's operations.
We evaluated our algorithms using common graph problems and micro-benchmarks and compared them to well-known conventional work stealing algorithms, the THE Cilk and Chase-Lev algorithms. We found that our best algorithm (with LIFO extraction) outperforms existing algorithms in nearly all cases, and often by significant margins.

References

[1]
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA, pages 119--129, June 1998.
[2]
David A. Bader and Guojing Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 65(9):994--1006, September 2005.
[3]
David A. Bader and Joseph JáJá. SIMPLE: a methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 58(1):92--108, July 1999.
[4]
Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: an efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP, pages 207--216, October 1995.
[5]
Philippe Charles, Christian Grothoff, Vijay A. Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. X10: an object-oriented approach to non-uniform cluster computing. In Proceedings of the Twentieth Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA, pages 519--538, October 2005.
[6]
David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the Seventeenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, SPAA, pages 21--28, July 2005.
[7]
Christine H. Flood, David Detlefs, Nir Shavit, and Xiaolan Zhang. Parallel garbage collection for shared memory multiprocessors. In Proceedings of the First Java Virtual Machine Research and Technology Symposium, JVM, pages 21--21, April 2001.
[8]
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design and Implementation, PLDI, pages 212--223, June 1998.
[9]
John Greiner. A comparison of parallel algorithms for connected components. In Proceedings of the Sixth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA, pages 16--25, June 1994.
[10]
Danny Hendler, Yossi Lev, Mark Moir, and Nir Shavit. A dynamic-sized nonblocking work stealing deque. Distributed Computing, 18(3):189--207, 2006.
[11]
Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In Proceedings Twenty-First Annual ACM Symposium on Principles of Distributed Computing, PODC, pages 280--289, July 2002.
[12]
IBM System/370 Extended Architecture, Principles of Operation, 1983. Publication No. SA22-7085.
[13]
Doug Lea. The JSR-133 Cookbook for Compiler Writers. Web page.
[14]
Maged M. Michael. Practical lock-free and wait-free LL/SC/VL implementations using 64-bit CAS. In Proceedings of the Eighteenth International Conference on Distributed Computing, DISC, pages 144--158, October 2004.

Cited By

View all
  • (2024)A Framework for Automated Parallel Execution of Scientific Multi-workflow Applications in the Cloud with Work StealingEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_21(298-311)Online publication date: 26-Aug-2024
  • (2022)Decidability of Liveness for Concurrent Objects on the TSO Memory ModelDependable Software Engineering. Theories, Tools, and Applications10.1007/978-3-031-21213-0_10(149-165)Online publication date: 11-Dec-2022
  • (2021)Consensus-Free Ledgers When Operations of Distinct Processes are CommutativeParallel Computing Technologies10.1007/978-3-030-86359-3_27(359-370)Online publication date: 7-Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices
ACM SIGPLAN Notices  Volume 44, Issue 4
PPoPP '09
April 2009
294 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/1594835
Issue’s Table of Contents
  • cover image ACM Conferences
    PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
    February 2009
    322 pages
    ISBN:9781605583976
    DOI:10.1145/1504176
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009
Published in SIGPLAN Volume 44, Issue 4

Check for updates

Author Tags

  1. atomic
  2. memory barrier
  3. memory fence
  4. work stealing

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)29
  • Downloads (Last 6 weeks)6
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Framework for Automated Parallel Execution of Scientific Multi-workflow Applications in the Cloud with Work StealingEuro-Par 2024: Parallel Processing10.1007/978-3-031-69583-4_21(298-311)Online publication date: 26-Aug-2024
  • (2022)Decidability of Liveness for Concurrent Objects on the TSO Memory ModelDependable Software Engineering. Theories, Tools, and Applications10.1007/978-3-031-21213-0_10(149-165)Online publication date: 11-Dec-2022
  • (2021)Consensus-Free Ledgers When Operations of Distinct Processes are CommutativeParallel Computing Technologies10.1007/978-3-030-86359-3_27(359-370)Online publication date: 7-Sep-2021
  • (2019)A mechanized refinement proof of the Chase---Lev deque using a proof systemComputing10.1007/s00607-018-0635-4101:1(59-74)Online publication date: 1-Jan-2019
  • (2018)Balanced double queues for GC work-stealing on weak memory modelsACM SIGPLAN Notices10.1145/3299706.321057053:5(109-119)Online publication date: 18-Jun-2018
  • (2018)Balanced double queues for GC work-stealing on weak memory modelsProceedings of the 2018 ACM SIGPLAN International Symposium on Memory Management10.1145/3210563.3210570(109-119)Online publication date: 18-Jun-2018
  • (2018)A Solution of Concurrent Queue on PSTM2018 26th Telecommunications Forum (TELFOR)10.1109/TELFOR.2018.8611903(1-4)Online publication date: Nov-2018
  • (2018)Scheduling Parallel Computations by Work StealingInternational Journal of Parallel Programming10.1007/s10766-016-0484-846:2(173-197)Online publication date: 1-Apr-2018
  • (2017)Don’t Sit on the FenceACM Transactions on Programming Languages and Systems10.1145/299459339:2(1-38)Online publication date: 29-May-2017
  • (2016)AEQUITASProceedings of the 2016 International Conference on Supercomputing10.1145/2925426.2926260(1-12)Online publication date: 1-Jun-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media