Article

Automatic mapping of nested loops to FPGAS

Authors:

Uday Bondhugula,

P. SadayappanAuthors Info & Claims

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 101 - 111

https://doi.org/10.1145/1229428.1229446

Published: 14 March 2007 Publication History

Abstract

This paper present a framework for automatic mapping of perfectly nested loops with constant dependences onto regular processor arrays, suitable for direct implementation on Field Programmable Gate Arrays (FPGAs). The problem is modeled as that of finding a suitable completion procedure for a full-rank linear transformation on the iteration space. The approach enables extraction of necessary degrees of communication-free and pipelined parallelism to optimize performance under the resource constraints of limited logic resources and I/O bandwidth available on an FPGA. The generation of control signals for the custom processing elements is also addressed. Examples of automatic derivation of parallel designs for some common nested loops are provided. Experimental results on the Cray XD1 show that an FPGA-based matrix-multiplication design obtained using the framework attains significant speedup on the XD1's attached FPGA, when compared to execution on the XD1 CPU.

References

[1]

PolyLib - A library of polyhedral functions. http://icps.u-strasbg.fr/polylib/.

[2]

C. Ancourt and F. Irigoin. Scanning polyhedra with do loops. In PPoPP'91, pages 39--50, 1991.

Digital Library

[3]

R. Bagnara, P. Hill, and E. Zaffanella. PPL: The Parma Polyhedra Library. http://www.cs.unipr.it/ppl/.

[4]

M. Bednara and J. Teich. Automatic synthesis of FPGA processor arrays from loop algorithms. Journal of Supercomputing, 26(2):149--165, Sept. 2003.

Digital Library

[5]

U. Bondhugula, A. Devulapalli, J. Dinan, J. Fernando, P. Wyckoff, E. Stahlberg, and P. Sadayappan. Hardware/software integration for FPGA-based all pairs shortest-paths. In IEEE FCCM'06, Apr. 2006.

Digital Library

[6]

U. Bondhugula, A. Devulapalli, J. Fernando, P. Wyckoff, and P. Sadayappan. Parallel FPGA-based all-pairs shortest-paths in a directed graph. In IPDPS'06, Apr. 2006.

Digital Library

[7]

ClearSpeed Inc. ClearSpeed CSX 600, 2004. http://www.clearspeed.com/.

[8]

Cray Inc. Cray XD1 whitepaper, 2005. http://www.cray.com/products/xd1/.

[9]

A. Darte, S. Derrien, and T. Risset. Hardware/Software Interface for Multi-Dimensional Processor Arrays. In IEEE ASAP, pages 28--35, 2005.

Digital Library

[10]

A. Darte, R. Schreiber, B. R. Rau, and F. Vivien. A Constructive Solution to the Juggling Problem in Processor Array Synthesis. In IPDPS'00, pages 815--822, 2000.

Digital Library

[11]

Y. Dou, S. Vassiliadis, G. K. Kuzmanov, and G. N. Gaydadjiev. 64-bit floating-point FPGA matrix multiplication. In ACM FPGA'05, pages 86--95, 2005.

Digital Library

[12]

P. Feautrier. Some efficient solutions to the affine scheduling problem. part II. multidimensional time. International Journal of Parallel Programming, 21(6):389--420, 1992.

Digital Library

[13]

M. B. Gokhale, J. M. Stone, J. Arnold, and M. Kalinowski. Stream-Oriented FPGA computing in the Streams-C high level language. In FCCM '00, pages 49--56, 2000.

Digital Library

[14]

A.-C. Guillou, P. Quinton, and T. Risset. Hardware synthesis for multi-dimensional time. IEEE ASAP'03, 00:40, 2003.

[15]

Z. Guo, B. Buyukkurt, and W. Najjar. Input data reuse in compiling window operations onto reconfigurable hardware. In LCTES '04, pages 249--256, 2004.

Digital Library

[16]

IRISA COSI. The MMAlpha environment. http://www.irisa.fr/cosi/ALPHA/mmalpha_english.html.

[17]

B. Kienhuis, E. Rijpkema, and E. Deprettere. Compaan: Deriving process networks from Matlab for embedded signal processing architectures. In 8th International workshop on Hardware/software codesign, pages 13--17, 2000.

Digital Library

[18]

W. Li and K. Pingali. A singular loop transformation framework based on non-singular matrices. International Journal of Parallel Programming, 2 (2):183--205, 1994.

Digital Library

[19]

A. W. Lim, G. I. Cheong, and M. S. Lam. An affine partitioning algorithm to maximize parallelism and minimize communication. In ICS'99, pages 228--237, 1999.

Digital Library

[20]

A. W. Lim and M. S. Lam. Maximizing parallelism and minimizing synchronization with affine partitions. Parallel Computing, 24(3-4):445--475, 1998.

Digital Library

[21]

D. I. Moldovan and J. A. B. Fortes. Partitioning and mapping algorithms into fixed size systolic arrays. IEEE Transactions on Computers, 35(1):1--12, 1986.

Digital Library

[22]

P. Quinton. Automatic synthesis of systolic Arrays from uniform recurrent equations. In ISCA, pages 208--214, 1984.

Digital Library

[23]

R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist, and M. Sivaraman. PICO-NPA: High-Level synthesis of non-programmable hardware accelerators. J. VLSI Signal Process. Syst., 31(2):127--142, 2002.

Digital Library

[24]

A. Schrijver. Theory of Linear and Integer Programming. Wiley, 1987. SchRI a 87:1 1.Ex.

Digital Library

[25]

B. So, M. W. Hall, and P. C. Diniz. A compiler approach to fast hardware design space exploration in FPGA-based systems. In PLDI'02, pages 165--176, 2002.

Digital Library

[26]

J. Teich, L. Thiele, and L. Zhang. Partitioning processor arrays under resource constraints. Int. Journal on VLSI and Signal Processing Systems, 17. (1):5--20, 1997.

Digital Library

[27]

K. Underwood. FPGAs vs. CPUs: Trends in peak floating-point performance. In ACM FPGA'04, 2004.

Digital Library

[28]

K. D. Underwood and K. S. Hemmert. Closing the Gap: CPU and FPGA Trends in Sustainable Floating-Point BLAS Performance. In IEEE FCCM'04, Apr. 2004.

Digital Library

[29]

Xilinx Inc. Virtex-II Pro and Virtex-II Pro X Platform FPGAs: complete data sheet. Xilinx Inc., 2005.

[30]

K. Yotov, X. Li, G. Ren, M. Cibulskis, G. DeJong, M. Garzaran, D. A. Padua, K. Pingali, P. Stodghill, and P. Wu. A comparison of empirical and model driven optimization. In PLDI'03, pages 63--76, 2003.

Digital Library

[31]

L. Zhuo and V. K. Prasanna. High performance linear algebra on a reconfigurable supercomputer. In Supercomputing, Nov. 2005.

Digital Library

Cited By

Iskandar VAbd El Ghany MGoehringer D(2023)Auto-DOK: Compiler-Assisted Automatic Detection of Offload Kernels for FPGA-HBM Architectures2023 26th Euromicro Conference on Digital System Design (DSD)10.1109/DSD60849.2023.00085(577-584)Online publication date: 6-Sep-2023
https://doi.org/10.1109/DSD60849.2023.00085
Xiang SLai YZhou YChen HZhang NPal DZhang ZAdler MIenne P(2022)HeteroFlowProceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3490422.3502369(78-88)Online publication date: 13-Feb-2022
https://dl.acm.org/doi/10.1145/3490422.3502369
Wijerathne DLi ZPathania AMitra TThiele L(2022)HiMap: Fast and Scalable High-Quality Mapping on CGRA via Hierarchical AbstractionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.313255141:10(3290-3303)Online publication date: Oct-2022
https://doi.org/10.1109/TCAD.2021.3132551
Show More Cited By

Index Terms

Automatic mapping of nested loops to FPGAS
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Reconfigurable computing
      2. Self-organizing autonomic computing
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Parallelizing tightly nested loops
IPPS '91: Proceedings of the Fifth International Parallel Processing Symposium

Presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. ...
Automatic Synthesis of FPGA Processor Arrays from Loop Algorithms

We consider the problem of automatic mapping of computation-intensive loop nests onto FPGA hardware. The regular cell array structure of these chips reflects the parallelism in regular loop-like computations. Furthermore, the flexibility of FPGAs allows ...
A method for estimating optimal unrolling times for nested loops
ISPAN '97: Proceedings of the 1997 International Symposium on Parallel Architectures, Algorithms and Networks

Loop unrolling is one of the most promising parallelization techniques, because the nature of programs causes most of the processing time to be spent in their loops. Unrolling not only the innermost loop but also outer loops greatly expands the scope ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

March 2007

284 pages

ISBN:9781595936028

DOI:10.1145/1229428

General Chair:
Katherine Yelick
UC Berkeley and Lawrence Berkeley National Lab., USA
,
Program Chair:
John Mellor-Crummey
Rice University, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 March 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

PPoPP07

Sponsor:

PPoPP07: ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

March 14 - 17, 2007

California, San Jose, USA

Acceptance Rates

PPoPP '07 Paper Acceptance Rate 22 of 65 submissions, 34%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

13
Total Citations
View Citations
681
Total Downloads

Downloads (Last 12 months)13
Downloads (Last 6 weeks)2

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Iskandar VAbd El Ghany MGoehringer D(2023)Auto-DOK: Compiler-Assisted Automatic Detection of Offload Kernels for FPGA-HBM Architectures2023 26th Euromicro Conference on Digital System Design (DSD)10.1109/DSD60849.2023.00085(577-584)Online publication date: 6-Sep-2023
https://doi.org/10.1109/DSD60849.2023.00085
Xiang SLai YZhou YChen HZhang NPal DZhang ZAdler MIenne P(2022)HeteroFlowProceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3490422.3502369(78-88)Online publication date: 13-Feb-2022
https://dl.acm.org/doi/10.1145/3490422.3502369
Wijerathne DLi ZPathania AMitra TThiele L(2022)HiMap: Fast and Scalable High-Quality Mapping on CGRA via Hierarchical AbstractionIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2021.313255141:10(3290-3303)Online publication date: Oct-2022
https://doi.org/10.1109/TCAD.2021.3132551
Lai YUstun EXiang SFang ZRong HZhang Z(2021)Programming and Synthesis for Software-defined FPGA Acceleration: Status and Future ProspectsACM Transactions on Reconfigurable Technology and Systems10.1145/346966014:4(1-39)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3469660
Wang JGuo LCong JShannon LAdler M(2021)AutoSAThe 2021 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays10.1145/3431920.3439292(93-104)Online publication date: 17-Feb-2021
https://dl.acm.org/doi/10.1145/3431920.3439292
Liu DLiu TMo XShang JYin S(2021)Polyhedral-based Pipelining of Imperfectly-Nested Loop for CGRAs2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)10.1109/ICCAD51958.2021.9643542(1-9)Online publication date: 1-Nov-2021
https://doi.org/10.1109/ICCAD51958.2021.9643542
Lai YRong HZheng SZhang WCui XJia YWang JSullivan BZhang ZLiang YZhang YCong JGeorge NAlvarez JHughes CDubey PXie Y(2020)SuSyProceedings of the 39th International Conference on Computer-Aided Design10.1145/3400302.3415644(1-9)Online publication date: 2-Nov-2020
https://dl.acm.org/doi/10.1145/3400302.3415644
Chugh NVasista VPurini SBondhugula UZaks AMendelson BRauchwerger LHwu W(2016)A DSL Compiler for Accelerating Image Processing Pipelines on FPGAsProceedings of the 2016 International Conference on Parallel Architectures and Compilation10.1145/2967938.2967969(327-338)Online publication date: 11-Sep-2016
https://dl.acm.org/doi/10.1145/2967938.2967969
Pérez-Andrade RTorres-Huitzil CCumplido R(2015)Processor arrays generation for matrix algorithms used in embedded platforms implemented on FPGAsMicroprocessors & Microsystems10.1016/j.micpro.2014.12.00339:7(576-588)Online publication date: 1-Oct-2015
https://dl.acm.org/doi/10.1016/j.micpro.2014.12.003
(2014)Symbolic inner loop parallelisation for massively parallel processor arraysProceedings of the Twelfth ACM/IEEE Conference on Formal Methods and Models for Codesign10.1109/MEMCOD.2014.6961865(219-228)Online publication date: 1-Oct-2014
https://dl.acm.org/doi/10.1109/MEMCOD.2014.6961865
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten