skip to main content
research-article
Open Access

RIPL: A Parallel Image Processing Language for FPGAs

Published:14 March 2018Publication History
Skip Abstract Section

Abstract

Specialized FPGA implementations can deliver higher performance and greater power efficiency than embedded CPU or GPU implementations for real-time image processing. Programming challenges limit their wider use, because the implementation of FPGA architectures at the register transfer level is time consuming and error prone. Existing software languages supported by high-level synthesis (HLS), although providing a productivity improvement, are too general purpose to generate efficient hardware without the use of hardware-specific code optimizations. Such optimizations leak hardware details into the abstractions that software languages are there to provide, and they require knowledge of FPGAs to generate efficient hardware, such as by using language pragmas to partition data structures across memory blocks.

This article presents a thorough account of the Rathlin image processing language (RIPL), a high-level image processing domain-specific language for FPGAs. We motivate its design, based on higher-order algorithmic skeletons, with requirements from the image processing domain. RIPL’s skeletons suffice to elegantly describe image processing stencils, as well as recursive algorithms with nonlocal random access patterns. At its core, RIPL employs a dataflow intermediate representation. We give a formal account of the compilation scheme from RIPL skeletons to static and cyclostatic dataflow models to describe their data rates and static scheduling on FPGAs.

RIPL compares favorably to the Vivado HLS OpenCV library and C++ compiled with Vivado HLS. RIPL achieves between 54 and 191 frames per second (FPS) at 100MHz for four synthetic benchmarks, faster than HLS OpenCV in three cases. Two real-world algorithms are implemented in RIPL: visual saliency and mean shift segmentation. For the visual saliency algorithm, RIPL achieves 71 FPS compared to optimized C++ at 28 FPS. RIPL is also concise, being 5x shorter than C++ and 111x shorter than an equivalent direct dataflow implementation. For mean shift segmentation, RIPL achieves 7 FPS compared to optimized C++ on 64 CPU cores at 1.1, and RIPL is 10x shorter than the direct dataflow FPGA implementation.

References

  1. S. Ahmad, V. Boppana, I. Ganusov, V. Kathail, V. Rajagopalan, and R. Wittig. 2016. A 16-nm multiprocessing system-on-chip field-programmable gate array platform. IEEE Micro 36, 2, 48--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Altera. 2017. DSP Builder for Intel FPGAs. Retrieved February 4, 2018, from https://www.altera.com/products/design-software/model---simulation/dsp-builder/overview.html.Google ScholarGoogle Scholar
  3. David L. Andrews, Douglas Niehaus, Razali Jidin, Michael Finley, Wesley Peck, Michael Frisbie, Jorge L. Ortiz, Ed Komp, and Peter J. Ashenden. 2004. Programming models for hybrid FPGA-CPU computational components: A missing link. IEEE Micro 24, 4, 42--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Endri Bezati. 2015. High-Level Synthesis of Dataflow Programs for Heterogeneous Platforms: Design Flow Tools and Design Space Exploration. Ph.D. Dissertation. School of Engineering, Ecole Polytechnique Federale de Lausanne, Switzerland.Google ScholarGoogle Scholar
  5. Endri Bezati, Simone Casale Brunet, Marco Mattavelli, and Jörn W. Janneck. 2016. High-level synthesis of dynamic dataflow programs on heterogeneous MPSoC platforms. In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS’16). IEEE, Los Alamitos, CA, 227--234.Google ScholarGoogle Scholar
  6. Deepayan Bhowmik, Paulo Garcia, Andrew M. Wallace, Robert J. Stewart, and Greg Michaelson. 2017. Power efficient dataflow design for a heterogeneous smart camera architecture. In Proceedings of the 2017 Conference on Design and Architectures for Signal and Image Processing (DASIP’17). IEEE, Los Alamitos, CA, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  7. Deepayan Bhowmik, Matthew Oakes, and Charith Abhayaratne. 2016. Visual attention-based image watermarking. IEEE Access 4, 8002--8018.Google ScholarGoogle ScholarCross RefCross Ref
  8. G. Bilsen, M. Engels, R. Lauwereins, and J. A. Peperstraete. 1996. Cycle-static dataflow. IEEE Transactions on Signal Processing 44, 2, 397--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ali Borji and Laurent Itti. 2013. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1, 185--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. André Rigland Brodtkorb, Christopher Dyken, Trond Runar Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli. 2010. State-of-the-art in heterogeneous computing. Scientific Programming 18, 1, 1--33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Manuel M. T. Chakravarty, Gabriele Keller, Sean Lee, Trevor L. McDonell, and Vinod Grover. 2011. Accelerating Haskell array codes with multicore GPUs. In Proceedings of the POPL 2011 Workshop on Declarative Aspects of Multicore Programming (DAMP’11). ACM, New York, NY, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Murray Cole. 1991. Algorithmic Skeletons: Structured Management of Parallel Computation. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dorin Comaniciu and Peter Meer. 1999. Mean shift analysis and applications. In Proceedings of the 7th IEEE International Conference on Computer Vision. IEEE, Los Alamitos, CA, 1197--1203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dorin Comaniciu, Visvanathan Ramesh, and Peter Meer. 2000. Real-time tracking of non-rigid objects using mean shift. In Proceedings of the 2000 Conference on Computer Vision and Pattern Recognition (CVPR’00). IEEE, Los Alamitos, CA, 2142.Google ScholarGoogle ScholarCross RefCross Ref
  15. Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: A survey of systems and software. ACM Computing Surveys 34, 2, 171--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Daubechies and W. Sweldens. 1998. Factoring wavelet transforms into lifting steps. Journal of Fourier Analysis and Applications 4, 3, 245--267.Google ScholarGoogle ScholarCross RefCross Ref
  17. Johan Eker and Jorn W. Janneck. 2003. CAL Language Report Specification of the CAL Actor Language. Technical Report UCB/ERL M03/48. EECS Department, University of California, Berkeley.Google ScholarGoogle Scholar
  18. Jeremy Fowers, Greg Brown, Patrick Cooke, and Greg Stitt. 2012. A performance and energy comparison of FPGAs, GPUs, and multicores for sliding-window applications. In Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays (FPGA’12). ACM, New York, NY, 47--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Keinosuke Fukunaga and Larry Hostetler. 1975. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Transactions on Information Theory 21, 1, 32--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Rafael C. González and Richard E. Woods. 1992. Digital Image Processing. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. James Hegarty, John Brunhaver, Zachary DeVito, Jonathan Ragan-Kelley, Noy Cohen, Steven Bell, Artem Vasilyev, Mark Horowitz, and Pat Hanrahan. 2014. Darkroom: Compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics 33, 4, 144:1--144:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. James Hegarty, Ross Daly, Zachary DeVito, Mark Horowitz, Pat Hanrahan, and Jonathan Ragan-Kelley. 2016. Rigel: Flexible multi-rate image processing hardware. ACM Transactions on Graphics 35, 4, 85:1--85:11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jörn W. Janneck. 2003. Actors and their composition. Formal Aspects of Computing 15, 4, 349--369.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Jeddeloh and B. Keeth. 2012. Hybrid Memory Cube new DRAM architecture increases density and performance. In Proceedings of the 2012 Symposium on VLSI Technology (VLSIT’12). IEEE, Los Alamitos, CA, 87--88.Google ScholarGoogle Scholar
  25. S. Peyton Jones, A. Tolmach, and T. Hoare. 2001. Playing by the rules: Rewriting as a practical optimisation technique in GHC. In Proceedings of the ACM SIGPLAN Haskell Workshop. ACM, New York, NY, 203--233.Google ScholarGoogle Scholar
  26. Kwang In Kim, Keechul Jung, and Jin Hyung Kim. 2003. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence 25, 12, 1631--1639. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Oleg Kiselyov. 2012. Iteratees. In Proceedings of the 11th International Symposium on Functional and Logic Programming (FLOPS’12). 166--181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Edward A. Lee and David G. Messerschmitt. 1987. Synchronous data flow: Describing signal processing algorithm for parallel computation. In Proceedings of the 32nd IEEE Computer Society International Conference (COMPCON’87). IEEE, Los Alamitos, CA, 310--315.Google ScholarGoogle Scholar
  29. Edward A. Lee and Thomas M. Parks. 2002. Dataflow process networks. In Readings in Hardware/Software Co-Design, G. De Micheli, R. Ernst, and W. Wolf (Eds.). Kluwer Academic Publishers, Norwell, MA, 59--85. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Erik Jan Marinissen and Yervant Zorian. 2017. Guest editors introduction: Design and test of a high-volume 3-D stacked graphics processor with high-bandwidth memory. IEEE Design and Test 34, 1, 6--7.Google ScholarGoogle ScholarCross RefCross Ref
  31. David R. Martin, Charless C. Fowlkes, Doron Tal, and Jitendra Malik. 2001. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the 8th IEEE International Conference on Computer Vision (ICCV’01). IEEE, Los Alamitos, CA, 416--425.Google ScholarGoogle ScholarCross RefCross Ref
  32. MathWorks. 2017. FPGA Design and SoC Codesign. Retrieved February 4, 2018, from https://uk.mathworks.com/solutions/fpga-design.html.Google ScholarGoogle Scholar
  33. J. McGraw, S. Skedzielewski, S. Allan, Oldehoeft Oldehoeft, J. Glauert, C. Kirkham, B. Noyce, and R. Thomas. 1985. SISAL: Streams and Iteration in a Single Assignment Language, Language Reference Manual Version 1.2. Lawrence-Livermore-National-Laboratory, Livermore, CA.Google ScholarGoogle Scholar
  34. R. Nane, V. M. Sima, C. Pilato, J. Choi, B. Fort, A. Canis, Y. T. Chen, H. Hsiao, S. Brown, F. Ferrandi, J. Anderson, and K. Bertels. 2016. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10, 1591--1604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jing Pu, Steven Bell, Xuan Yang, Jeff Setter, Stephen Richardson, Jonathan Ragan-Kelley, and Mark Horowitz. 2017. Programming heterogeneous systems from an image processing DSL. ACM Transactions on Architecture and Code Optimization 14, 3, 26:1--26:25. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. B. C. Schafer and A. Mahapatra. 2014. S2CBench: Synthesizable SystemC benchmark suite for high-level synthesis. IEEE Embedded Systems Letters 6, 3, 53--56.Google ScholarGoogle ScholarCross RefCross Ref
  37. Stephen Neuendorffer, Thomas Li, and Devin Wang. 2015. Accelerating OpenCV Applications With Zynq-7000 All Programmable SoC Using Vivado HLS Video Libraries (v3.0). Technical Report. Xilinx. https://www.xilinx.com/support/documentation/application_notes/xapp1167.pdf.Google ScholarGoogle Scholar
  38. Robert Stewart. 2018. Open dataset for “RIPL: A Parallel Image Processing Language for FPGAs.” ACM Transactions on Reconfigurable Technology and Systems. Forthcoming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Robert Stewart, Greg J. Michaelson, Deepayan Bhowmik, Paulo Garcia, and Andy Wallace. 2016. A dataflow IR for memory efficient RIPL compilation to FPGAs. In Algorithms and Architectures for Parallel Processing. Lecture Notes in Computer Science, Vol. 1194. Springer, 174--188.Google ScholarGoogle Scholar
  40. Robert J. Stewart, Deepayan Bhowmik, Andrew M. Wallace, and Greg Michaelson. 2017. Profile guided dataflow transformation for FPGAs and CPUs. Signal Processing Systems 87, 1, 3--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David Taubman and Michael Marcellin. 2012. JPEG2000 Image Compression Fundamentals, Standards and Practice. Vol. 642. Springer Science 8 Business Media, Berlin, Germany. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. David B. Thomas, Lee W. Howes, and Wayne Luk. 2009. A comparison of CPUs, GPUs, FPGAs, and massively parallel processor arrays for random number generation. In Proceedings of the ACM/SIGDA 17th International Symposium on Field Programmable Gate Arrays (FPGA’09). ACM, New York, NY, 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Donald E. Thomas and Philip Moorby. 1996. The Verilog Hardware Description Language (3rd ed.). Kluwer, Boston, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. William A. Wulf and Sally A. McKee. 1995. Hitting the memory wall: Implications of the obvious. ACM SIGARCH Computer Architecture News 23, 1, 20--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Xilinx. 2015. 7 Series FPGAs Overview, DS180 (v1.17) Product Specification. Technical Report. Xilinx.Google ScholarGoogle Scholar
  46. Xilinx. 2017a. System Generator for DSP. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/sysgen.html.Google ScholarGoogle Scholar
  47. Xilinx. 2017b. Vivado High-Level Synthesis. Retrieved February 4, 2018, from https://www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.Google ScholarGoogle Scholar

Index Terms

  1. RIPL: A Parallel Image Processing Language for FPGAs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader