ABSTRACT
Many applications, such as medical imaging, generate intensive data traffic between the FPGA and off-chip memory. Significant improvements in the execution time can be achieved with effective utilization of on-chip (scratchpad) memories, associated with careful software-based data reuse and communication scheduling techniques. We present a fully automated C-to-FPGA framework to address this problem. Our framework effectively implements data reuse through aggressive loop transformation-based program restructuring. In addition, our proposed framework automatically implements critical optimizations for performance such as task-level parallelization, loop pipelining, and data prefetching.
We leverage the power and expressiveness of the polyhedral compilation model to develop a multi-objective optimization system for off-chip communications management. Our technique can satisfy hardware resource constraints (scratchpad size) while still aggressively exploiting data reuse. Our approach can also be used to reduce the on-chip buffer size subject to bandwidth constraint. We also implement a fast design space exploration technique for effective optimization of program performance using the Xilinx high-level synthesis tool.
- Center for domain-specific computing. http://cdsc.ucla.edu.Google Scholar
- Convey. http://www.conveycomputer.com.Google Scholar
- http://www.xilinx.com/products/design-tools/ise-design-suite/index.htm.Google Scholar
- Pocc 1.1. http://pocc.sourceforge.net.Google Scholar
- An independent evaluation of the autoesl autopilot high-level synthesis tool. Technical report, Berkeley Design Technology, Inc., 2010.Google Scholar
- N. Ahmed, N. Mateev, and K. Pingali. Tiling imperfectly-nested loop nests. In ACM/IEEE Conf. on Supercomputing (SC'00), Dallas, TX, USA, Nov. 2000. Google ScholarDigital Library
- C. Alias, A. Darte, and A. Plesco. Optimizing remote accesses for offloaded kernels: application to high-level synthesis for fpga. SIGPLAN Not., 47(8):285--286, Feb. 2012. Google ScholarDigital Library
- J. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures. Morgan Kaufmann Publishers, 2002. Google ScholarDigital Library
- M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In ACM Symposium on Principles and practice of parallel programming, pages 1--10. ACM, 2008. Google ScholarDigital Library
- C. Bastoul. Code generation in the polyhedral model is easier than you think. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques (PACT'04), pages 7--16, Sept. 2004. Google ScholarDigital Library
- S. Bayliss and G. A. Constantinides. Optimizing sdram bandwidth for custom fpga loop accelerators. In Proceedings of the ACM/SIGDA international symposium on Field Programmable Gate Arrays, FPGA '12, pages 195--204, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic polyhedral program optimization system. In ACM SIGPLAN Conference on Programming Language Design and Implementation, June 2008. Google ScholarDigital Library
- E. Brockmeyer, M. Miranda, and F. Catthoor. Layer assignment techniques for low energy in multi-layered memory organisations. In Design, Automation and Test in Europe Conference and Exhibition, 2003, pages 1070--1075, 2003. DATE. Google ScholarDigital Library
- F. Catthoor, K. Danckaert, K. Kulkarni, E. Brockmeyer, P. Kjeldsberg, T. v. Achteren, and T. Omnes. Data access and storage management for embedded programmable processors. Kluwer Academic Publishers, Norwell, MA, USA, 2002. Google ScholarDigital Library
- F. Catthoor, E. d. Greef, and S. Suytack. Custom Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia System Design. Kluwer Academic Publishers, Norwell, MA, USA, 1998. Google ScholarDigital Library
- J. Cong, K. Guruaj, M. Huang, S. Li, B. Xiao, and Y. Zou. Domain-specific processor with 3d integration for medical image processing. In IEEE Intl. Conf. on Application-Specific Systems, Architectures and Processors, pages 247--250, sept. 2011. Google ScholarDigital Library
- J. Cong, M. Huang, and Y. Zou. Accelerating fluid registration algorithm on multi-fpga platforms. In Proc. of Intl. Conf. on Field Programmable Logic and Applications (FPL'11). IEEE, 2011. Google ScholarDigital Library
- J. Cong, B. Liu, S. Neuendorffer, J. Noguera, K. Vissers, and Z. Zhang. High-level synthesis for fpgas: From prototyping to deployment. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 30(4):473--491, april 2011. Google ScholarDigital Library
- J. Cong, P. Zhang, and Y. Zou. Optimizing memory hierarchy allocation with loop transformations for high-level synthesis. In Design Automation Conference (DAC'12), June 2012. Google ScholarDigital Library
- A. Darte, R. Schreiber, and G. Villard. Lattice-based memory allocation. IEEE Trans. Comput., 54(10):1242--1257, 2005. Google ScholarDigital Library
- P. Diniz, M. Hall, J. Park, B. So, and H. Ziegler. Bridging the gap between compilation and synthesis in the defacto system. In LCPC'03, pages 52--70. 2003. Google ScholarDigital Library
- P. Feautrier. Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time. Int. J. Parallel Program., 21(5):389--420, 1992. Google ScholarDigital Library
- S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Intl. J. of Parallel Programming, 34(3), 2006. Google ScholarDigital Library
- A. Grosslinger. Precise Management of Scratchpad Memories for Localising Array Accesses in Scientific Codes. In Compiler Construction, pages 236--250, 2009. Google ScholarDigital Library
- A.-C. Guillou, F. Quilleré, P. Quinton, S. Rajopadhye, and T. Risset. Hardware design methodology with the Alpha language. In FDL'01, Lyon, France, Sept. 2001.Google Scholar
- Q. Hu, P. G. Kjeldsberg, A. Vandecappelle, M. Palkovic, and F. Catthoor. Incremental hierarchical memory size estimation for steering of loop transformations. ACM Trans. Des. Autom. Electron. Syst., 12, September 2007. Google ScholarDigital Library
- F. Irigoin and R. Triolet. Supernode partitioning. In ACM SIGPLAN Principles of Programming Languages, pages 319--329, 1988. Google ScholarDigital Library
- I. Issenin, E. Brockmeyer, M. Miranda, and N. Dutt. Drdu: A data reuse analysis technique for efficient scratch-pad memory management. ACM Trans. Des. Autom. Electron. Syst., 12, April 2007. Google ScholarDigital Library
- M. Kandemir and A. Choudhary. Compiler-directed scratch pad memory hierarchy design and management. In Design Automation Conference, 2002. Proceedings. 39th, pages 628--633, 2002. Google ScholarDigital Library
- I. Kodukula, N. Ahmed, and K. Pingali. Data-centric multi-level blocking. In ACM SIGPLAN'97 Conf. on Programming Language Design and Implementation, pages 346--357, Las Vegas, June 1997. Google ScholarDigital Library
- Q. Liu, G. A. Constantinides, K. Masselos, and P. Cheung. Combining data reuse with data-level parallelization for fpga-targeted hardware compilation: A geometric programming framework. Trans. Comp.-Aided Design of Integr. Circuits and Systems, 28(3):305--315, 2009. Google ScholarDigital Library
- M. Palkovic, F. Catthoor, and H. Corporaal. Trade-offs in loop transformations. ACM Trans. Des. Autom. Electron. Syst., 14:22:1--22:30, April 2009. Google ScholarDigital Library
- P. R. Panda, N. D. Dutt, and A. Nicolau. Local memory exploration and optimization in embedded systems. IEEE Trans. on CAD of Integrated Circuits and Systems, 18:3--13, January 1999. Google ScholarDigital Library
- PolyOpt: A complete source-to-source Polyhedral Compiler, http://www.cse.ohio-state.edu/pouchet/polyopt.Google Scholar
- L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part I, one-dimensional time. In IEEE/ACM Intl. Symp. on Code Generation and Optimization (CGO'07), pages 144--156, 2007. Google ScholarDigital Library
- B. So, M. W. Hall, and P. C. Diniz. A compiler approach to fast hardware design space exploration in fpga-based systems. In Programming Language Design and Implementation, 2002. Google ScholarDigital Library
- K. Trifunovic, D. Nuzman, A. Cohen, A. Zaks, and I. Rosen. Polyhedral-model guided loop-nest auto-vectorization. In IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, pages 327--337, 2009. Google ScholarDigital Library
- S. Verdoolaege. isl: An integer set library for the polyhedral model. In Mathematical Software - ICMS 2010, pages 299--302, 2010. Google ScholarDigital Library
- M. Wolf and M. Lam. A data locality optimizing algorithm. In ACM SIGPLAN'91 Conf. on Programming Language Design and Implementation, pages 30--44, New York, June 1991. Google ScholarDigital Library
- M. Wolfe. Iteration space tiling for memory hierarchies. In 3rd SIAM Conf. on Parallel Processing for Scientific Computing, pages 357--361, Dec. 1987. Google ScholarDigital Library
- W. Zuo, Y. Liang, P. Li, K. Rupnow, D. Chen, and J. Cong. Improving High Level Synthesis Optimization Opportunity Through Polyhedral Transformations. In Proc. of the ACM/SIGDA Intl. Symp. on Field Programmable Gate Arrays (FPGA'13), 2013. Google ScholarDigital Library
Index Terms
- Polyhedral-based data reuse optimization for configurable computing
Recommendations
C-to-CoRAM: compiling perfect loop nests to the portable CoRAM abstraction
FPGA '13: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysThis paper presents initial work on developing a C compiler for the CoRAM FPGA computing abstraction. The presented effort focuses on compiling fixed-bound perfect loop nests that operate on large data sets in external DRAM. As required by the CoRAM ...
Efficient hardware code generation for FPGAs
The wider acceptance of FPGAs as a computing device requires a higher level of programming abstraction. ROCCC is an optimizing C to HDL compiler. We describe the code generation approach in ROCCC. The smart buffer is a component that reuses input data ...
Analyzing data reuse for cache reconfiguration
Classical compiler optimizations assume a fixed cache architecture and modify the program to take best advantage of it. In some cases, this may not be the best strategy because each nest might work best with a different cache configuration and ...
Comments