research-article

Open Access

RIPL: A Parallel Image Processing Language for FPGAs

Authors:
Robert Stewart

Heriot-Watt University, Edinburgh, UK

Heriot-Watt University, Edinburgh, UK

0000-0003-0365-693X
View Profile

,
Kirsty Duncan

Heriot-Watt University, Edinburgh, UK

Heriot-Watt University, Edinburgh, UK
View Profile

,
Greg Michaelson

Heriot-Watt University, Edinburgh, UK

Heriot-Watt University, Edinburgh, UK
View Profile

,
Paulo Garcia

Heriot-Watt University, Edinburgh, UK

Heriot-Watt University, Edinburgh, UK
View Profile

,
Deepayan Bhowmik

Sheffield Hallam University, Sheffield, UK

Sheffield Hallam University, Sheffield, UK
View Profile

,
Andrew Wallace

Heriot-Watt University, Edinburgh, UK

Heriot-Watt University, Edinburgh, UK
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 11 Issue 1Article No.: 7pp 1–24https://doi.org/10.1145/3180481

Published:14 March 2018Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks.

This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs.

RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.

References

S. Ahmad, V. Boppana, I. Ganusov, V. Kathail, V. Rajagopalan, and R. Wittig. 2016. A 16-nm multiprocessing system-on-chip field-programmable gate array platform. IEEE Micro 36, 2, 48--62. Google ScholarDigital Library
Altera. 2017. DSP Builder for Intel FPGAs. Retrieved February 4, 2018, from https://www.altera.com/products/design-software/model---simulation/dsp-builder/overview.html.Google Scholar
David L. Andrews, Douglas Niehaus, Razali Jidin, Michael Finley, Wesley Peck, Michael Frisbie, Jorge L. Ortiz, Ed Komp, and Peter J. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4, 42--53. Google ScholarDigital Library
Endri Bezati. 2015. High-Level Synthesis of Dataflow Programs for Heterogeneous Platforms: Design Flow Tools and Design Space Exploration. Ph.D. Dissertation. School of Engineering, Ecole Polytechnique Federale de Lausanne, Switzerland.Google Scholar
Endri Bezati, Simone Casale Brunet, Marco Mattavelli, and Jörn W. Janneck. 2016. High-level synthesis of dynamic dataflow programs on heterogeneous MPSoC platforms. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’16). IEEE, Los Alamitos, CA, 227--234.Google Scholar
Deepayan Bhowmik, Paulo Garcia, Andrew M. Wallace, Robert J. Stewart, and Greg Michaelson. 2017. Power efficient dataflow design for a heterogeneous smart camera architecture. In Proceedings of the 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP’17). IEEE, Los Alamitos, CA, 1--6.Google ScholarCross Ref
Deepayan Bhowmik, Matthew Oakes, and Charith Abhayaratne. 2016. Visual attention-based image watermarking. IEEE Access 4, 8002--8018.Google ScholarCross Ref
G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete. 1996. Cycle-static dataflow. IEEE Transactions on Signal Processing 44, 2, 397--408. Google ScholarDigital Library
Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 185--207. Google ScholarDigital Library
André Rigland Brodtkorb, Christopher Dyken, Trond Runar Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. 2010. State-of-the-art in heterogeneous computing. Scientific Programming 18, 1, 1--33. Google ScholarDigital Library
Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming (DAMP’11). ACM, New York, NY, 3--14. Google ScholarDigital Library
Murray Cole. 1991. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, MA. Google ScholarDigital Library
Dorin Comaniciu and Peter Meer. 1999. Mean shift analysis and applications. In Proceedings of the 7th IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1197--1203. Google ScholarDigital Library
Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. 2000. Real-time tracking of non-rigid objects using mean shift. In Proceedings of the 2000 Conference on Computer Vision and Pattern Recognition (CVPR’00). IEEE, Los Alamitos, CA, 2142.Google ScholarCross Ref
Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34, 2, 171--210. Google ScholarDigital Library
I. Daubechies and W. Sweldens. 1998. Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications 4, 3, 245--267.Google ScholarCross Ref
Johan Eker and Jorn W. Janneck. 2003. CAL Language Report Specification of the CAL Actor Language. Technical Report UCB/ERL M03/48. EECS Department, University of California, Berkeley.Google Scholar
Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. Google ScholarDigital Library
Keinosuke Fukunaga and Larry Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1, 32--40. Google ScholarDigital Library
Rafael C. González and Richard E. Woods. 1992. Digital Image Processing. Addison-Wesley, Reading, MA. Google ScholarDigital Library
James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4, 144:1--144:11. Google ScholarDigital Library
James Hegarty, Ross Daly, Zachary DeVito, Mark Horowitz, Pat Hanrahan, and Jonathan Ragan-Kelley. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 35, 4, 85:1--85:11. Google ScholarDigital Library
Jörn W. Janneck. 2003. Actors and their composition. Formal Aspects of Computing 15, 4, 349--369.Google ScholarDigital Library
J. Jeddeloh and B. Keeth. 2012. Hybrid Memory Cube new DRAM architecture increases density and performance. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE, Los Alamitos, CA, 87--88.Google Scholar
S. Peyton Jones, A. Tolmach, and T. Hoare. 2001. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Proceedings of the ACM SIGPLAN Haskell Workshop. ACM, New York, NY, 203--233.Google Scholar
Kwang In Kim, Keechul Jung, and Jin Hyung Kim. 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 12, 1631--1639. Google ScholarDigital Library
Oleg Kiselyov. 2012. Iteratees. In Proceedings of the 11th International Symposium on Functional and Logic Programming (FLOPS’12). 166--181. Google ScholarDigital Library
Edward A. Lee and David G. Messerschmitt. 1987. Synchronous data flow: Describing signal processing algorithm for parallel computation. In Proceedings of the 32nd IEEE Computer Society International Conference (COMPCON’87). IEEE, Los Alamitos, CA, 310--315.Google Scholar
Edward A. Lee and Thomas M. Parks. 2002. Dataflow process networks. In Readings in Hardware/Software Co-Design, G. De Micheli, R. Ernst, and W. Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, 59--85. Google ScholarDigital Library
Erik Jan Marinissen and Yervant Zorian. 2017. Guest editors introduction: Design and test of a high-volume 3-D stacked graphics processor with high-bandwidth memory. IEEE Design and Test 34, 1, 6--7.Google ScholarCross Ref
David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). IEEE, Los Alamitos, CA, 416--425.Google ScholarCross Ref
MathWorks. 2017. FPGA Design and SoC Codesign. Retrieved February 4, 2018, from https://uk.mathworks.com/solutions/fpga-design.html.Google Scholar
J. McGraw, S. Skedzielewski, S. Allan, Oldehoeft Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas. 1985. SISAL: Streams and Iteration in a Single Assignment Language, Language Reference Manual Version 1.2. Lawrence-Livermore-National-Laboratory, Livermore, CA.Google Scholar
R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10, 1591--1604. Google ScholarDigital Library
Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from an image processing DSL. ACM Transactions on Architecture and Code Optimization 14, 3, 26:1--26:25. Google ScholarDigital Library
B. C. Schafer and A. Mahapatra. 2014. S2CBench: Synthesizable SystemC benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3, 53--56.Google ScholarCross Ref
Stephen Neuendorffer, Thomas Li, and Devin Wang. 2015. Accelerating OpenCV Applications With Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries (v3.0). Technical Report. Xilinx. https://www.xilinx.com/support/documentation/application_notes/xapp1167.pdf.Google Scholar
Robert Stewart. 2018. Open dataset for “RIPL: A Parallel Image Processing Language for FPGAs.” ACM Transactions on Reconfigurable Technology and Systems. Forthcoming. Google ScholarDigital Library
Robert Stewart, Greg J. Michaelson, Deepayan Bhowmik, Paulo Garcia, and Andy Wallace. 2016. A dataflow IR for memory efficient RIPL compilation to FPGAs. In Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, Vol. 1194. Springer, 174--188.Google Scholar
Robert J. Stewart, Deepayan Bhowmik, Andrew M. Wallace, and Greg Michaelson. 2017. Profile guided dataflow transformation for FPGAs and CPUs. Signal Processing Systems 87, 1, 3--20. Google ScholarDigital Library
David Taubman and Michael Marcellin. 2012. JPEG2000 Image Compression Fundamentals, Standards and Practice. Vol. 642. Springer Science 8 Business Media, Berlin, Germany. Google ScholarDigital Library
David B. Thomas, Lee W. Howes, and Wayne Luk. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA 17th International Symposium on Field Programmable Gate Arrays (FPGA’09). ACM, New York, NY, 63--72. Google ScholarDigital Library
Donald E. Thomas and Philip Moorby. 1996. The Verilog Hardware Description Language (3rd ed.). Kluwer, Boston, MA. Google ScholarDigital Library
William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1, 20--24. Google ScholarDigital Library
Xilinx. 2015. 7 Series FPGAs Overview, DS180 (v1.17) Product Specification. Technical Report. Xilinx.Google Scholar
Xilinx. 2017a. System Generator for DSP. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/sysgen.html.Google Scholar
Xilinx. 2017b. Vivado High-Level Synthesis. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google Scholar

Index Terms

RIPL: A Parallel Image Processing Language for FPGAs

Recommendations

Programming Heterogeneous Systems from an Image Processing DSL

Specialized image processing accelerators are necessary to deliver the performance and energy efficiency required by important applications in computer vision, computational photography, and augmented reality. But creating, “programming,” and ...
Read More
From software to accelerators with LegUp high-level synthesis
CASES '13: Proceedings of the 2013 International Conference on Compilers, Architectures and Synthesis for Embedded Systems

Embedded system designers can achieve energy and performance benefits by using dedicated hardware accelerators. However, implementing custom hardware accelerators for an application can be difficult and time intensive. LegUp is an open-source high-level ...
Read More
High-performance CUDA kernel execution on FPGAs
ICS '09: Proceedings of the 23rd international conference on Supercomputing

In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Reconfigurable Technology and Systems Volume 11, Issue 1
Special Section on FCCM 2016 and Regular Papers
March 2018
183 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/3178391
Editor:
Steve Wilton
Department of Electrical and Computer Engineering / University of British Columbia / Kaiser 4112, 5500-2332 Main Mall / Vancouver, BC V6T 1Z4 Canada
Issue’s Table of Contents
Copyright © 2018 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 March 2018
- Accepted: 1 December 2017
- Revised: 1 November 2017
- Received: 1 February 2017
Published in trets Volume 11, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cyclo static dataflow
Dataflow
Domain specific languages
FPGA
Hardware accelerators
High level synthesis
Image processing
OpenCV
Parallel processing
RIPL
Semantics
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 1,325
  Total Downloads
- Downloads (Last 12 months)140
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

RIPL: A Parallel Image Processing Language for FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Programming Heterogeneous Systems from an Image Processing DSL

From software to accelerators with LegUp high-level synthesis

High-performance CUDA kernel execution on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

RIPL: A Parallel Image Processing Language for FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Programming Heterogeneous Systems from an Image Processing DSL

From software to accelerators with LegUp high-level synthesis

High-performance CUDA kernel execution on FPGAs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media