research-article

Analysis of dynamic scheduling strategies for matrix multiplication on heterogeneous platforms

Authors:

Olivier Beaumont,

Loris MarchalAuthors Info & Claims

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Pages 141 - 152

https://doi.org/10.1145/2600212.2600223

Published: 23 June 2014 Publication History

Abstract

The tremendous increase in the size and heterogeneity of supercomputers makes it very difficult to predict the performance of a scheduling algorithm. Therefore, dynamic solutions, where scheduling decisions are made at runtime have overpassed static allocation strategies. The simplicity and efficiency of dynamic schedulers such as Hadoop are a key of the success of the MapReduce framework. Dynamic schedulers such as StarPU, PaRSEC or StarSs are also developed for more constrained computations, e.g. task graphs coming from linear algebra. To make their decisions, these runtime systems make use of some static information, such as the distance of tasks to the critical path or the affinity between tasks and computing resources (CPU, GPU, …) and of dynamic information, such as where input data are actually located. In this paper, we concentrate on two elementary linear algebra kernels, namely the outer product and the matrix multiplication. For each problem, we propose several dynamic strategies that can be used at runtime and we provide an analytic study of their theoretical performance. We prove that the theoretical analysis provides very good estimate of the amount of communications induced by a dynamic strategy and can be used in order to efficiently determine thresholds used in dynamic scheduler, thus enabling to choose among them for a given problem and architecture.

References

[1]

C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 23(2):187--198, 2011.

Digital Library

[2]

O. Beaumont, V. Boudet, F. Rastello, and Y. Robert. Partitioning a square into rectangles: NP-completeness and approximation algorithms. Algorithmica, 34(3):217--239, 2002.

Digital Library

[3]

O. Beaumont, H. Larcheveque, and L. Marchal. Non linear divisible loads: There is no free lunch. In International Parallel and Distributed Processing Symposium, 2012, pages 863--873. IEEE, 2012.

Digital Library

[4]

M. Benaim and J.-Y. Le Boudec. A class of mean field interaction models for computer and communication systems. Performance Evaluation, 65(11):823--838, 2008.

Digital Library

[5]

V. Berten and B. Gaujal. Brokering strategies in computational grids using stochastic prediction models. Parallel Computing, 33(4--5):238--249, 2007.

Digital Library

[6]

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault, and J. J. Dongarra. PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability. IEEE Computing in Science and Engineering, to appear. available online at http://www.netlib.org/utk/people/JackDongarra/PAPERS/ieee_cise_submitted_2.pdf.

[7]

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra. DAGuE: A generic distributed DAG engine for high performance computing. Parallel Computing, 38(1):37--51, 2012.

Digital Library

[8]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[9]

N. Gast, B. Gaujal, and J.-Y. Le Boudec. Mean field for Markov decision processes: from discrete to continuous optimization. IEEE Transactions on Automatic Control, 57(9):2266--2280, 2012.

[10]

T. Gautier, X. Besseron, and L. Pigeon. Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO '07, pages 15--23, New York, NY, USA, 2007. ACM.

Digital Library

[11]

D. Gross, J. F. Shortle, J. M. Thompson, and C. M. Harris. Fundamentals of Queueing Theory, 4th Edition. John Wiley and Sons", 2008.

Digital Library

[12]

B. Kreaseck, L. Carter, H. Casanova, and J. Ferrante. Autonomous protocols for bandwidth-centric scheduling of independent-task applications. In Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS'03). IEEE, 2003.

Digital Library

[13]

M. Mitzenmacher. Analyses of load stealing models based on differential equations. In Proceedings of the Tenth Annual ACM Symposium on Parallel Algorithms and Architectures, SPAA '98, pages 212--221, New York, NY, USA, 1998. ACM.

Digital Library

[14]

M. Mitzenmacher. How useful is old information? IEEE Trans. Parallel Distrib. Syst., 11(1):6--20, 2000.

Digital Library

[15]

M. Parashar and S. Hariri. Autonomic computing: concepts, infrastructure, and applications. CRC press, 2006.

Digital Library

[16]

J. Planas, R. M. Badia, E. Ayguadé, and J. Labarta. Hierarchical task-based programming with StarSs. International Journal of High Performance Computing Applications, 23(3):284--299, 2009.

Digital Library

[17]

H. Topcuoglu, S. Hariri, and M.-y. Wu. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems, 13(3):260--274, 2002.

Digital Library

Cited By

Gonthier MMarchal LThibault S(2022)Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00073(694-704)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00073
Catalán SIgual FMayo RRodríguez-Sánchez RQuintana-Ortí E(2016)Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processorsCluster Computing10.1007/s10586-016-0611-819:3(1037-1051)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10586-016-0611-8
Beaumont OEyrauld-Dubois LGuermouche ALambert T(2015)Comparison of Static and Runtime Resource Allocation Strategies for Matrix MultiplicationProceedings of the 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2015.10(170-177)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1109/SBAC-PAD.2015.10

Index Terms

Analysis of dynamic scheduling strategies for matrix multiplication on heterogeneous platforms
1. Theory of computation
  1. Design and analysis of algorithms

Recommendations

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs
SCC '12: Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis

OpenCL (Open Computing Language) is a framework for general-purpose parallel programming. Programs written in OpenCL are functionally portable across multiple processors including CPUs, GPUs, and also FPGAs. Using an auto-tuning technique makes ...
HPMaX: heterogeneous parallel matrix multiplication using CPUs and GPUs
Abstract
We present a novel heterogeneous parallel matrix multiplication algorithm that utilizes both central processing units (CPUs) and graphics processing units (GPUs) for large-scale matrices. Based on Strassen’s method, we represent matrix ...
Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors

Asymmetric multicore processors have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

June 2014

334 pages

ISBN:9781450327497

DOI:10.1145/2600212

General Chairs:
Beth Plale
Indiana University, USA
,
Matei Ripeanu
University of British Columbia, CA
,
Program Chairs:
Franck Cappello
Argonne National Lab and INRIA, USA
,
Dongyan Xu
Purdue University, USA

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 June 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'14

Sponsor:

SIGARCH

HPDC'14: The 23rd International Symposium on High-Performance Parallel and Distributed Computing

June 23 - 27, 2014

BC, Vancouver, Canada

Acceptance Rates

HPDC '14 Paper Acceptance Rate 21 of 130 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
228
Total Downloads

Downloads (Last 12 months)1
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gonthier MMarchal LThibault S(2022)Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS53621.2022.00073(694-704)Online publication date: May-2022
https://doi.org/10.1109/IPDPS53621.2022.00073
Catalán SIgual FMayo RRodríguez-Sánchez RQuintana-Ortí E(2016)Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processorsCluster Computing10.1007/s10586-016-0611-819:3(1037-1051)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10586-016-0611-8
Beaumont OEyrauld-Dubois LGuermouche ALambert T(2015)Comparison of Static and Runtime Resource Allocation Strategies for Matrix MultiplicationProceedings of the 2015 27th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)10.1109/SBAC-PAD.2015.10(170-177)Online publication date: 17-Oct-2015
https://dl.acm.org/doi/10.1109/SBAC-PAD.2015.10

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten