|
ABSTRACT
We describe a new processing architecture, known as a warp processor, that utilizes a field-programmable gate array (FPGA) to improve the speed and energy consumption of a software binary executing on a microprocessor. Unlike previous approaches that also improve software using an FPGA but do so using a special compiler, a warp processor achieves these improvements completely transparently and operates from a standard binary. A warp processor dynamically detects the binary's critical regions, reimplements those regions as a custom hardware circuit in the FPGA, and replaces the software region by a call to the new hardware implementation of that region. While not all benchmarks can be improved using warp processing, many can, and the improvements are dramatically better than those achievable by more traditional architecture improvements. The hardest part of warp processing is that of dynamically reimplementing code regions on an FPGA, requiring partitioning, decompilation, synthesis, placement, and routing tools, all having to execute with minimal computation time and data memory so as to coexist on chip with the main processor. We describe the results of developing our warp processor. We developed a custom FPGA fabric specifically designed to enable lean place and route tools, and we developed extremely fast and efficient versions of partitioning, decompilation, synthesis, technology mapping, placement, and routing. Warp processors achieve overall application speedups of 6.3X with energy savings of 66&percent; across a set of embedded benchmark applications. We further show that our tools utilize acceptably small amounts of computation and memory which are far less than traditional tools. Our work illustrates the feasibility and potential of warp processing, and we can foresee the possibility of warp processing becoming a feature in a variety of computing domains, including desktop, server, and embedded applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Altera Corp. 2006. Customer showcase. http://www.altera.com/corporate/cust_successes/ customer_showcase/view_product/csh-vproduct-nios.jsp.
|
| |
2
|
Altera Corp. 2005. Excalibur embedded processor solutions. http://www.altera.com/products/ devices/excalibur/exc-index.html.
|
| |
3
|
Atmel Corp. 2005. FPSLIC (AVR with FPGA), http://www.atmel.com/products/FPSLIC/.
|
| |
4
|
|
| |
5
|
Banerjee, P., Mittal, G., Zaretsky, D., and Tang, X. 2004. BINACHIP-FPGA: A tool to map DSP software binaries and assembly programs onto FPGAs. In Proceedings of the Embedded Signal Processing Conference (GSPx).
|
| |
6
|
Berkeley Design Technology, Inc. 2004. http://www.bdti.com/articles/info_eet0207fpga.htm# DSPEnhanced&percent;20FPGAs.
|
| |
7
|
|
| |
8
|
|
| |
9
|
W. Böhm , J. Hammes , B. Draper , M. Chawathe , C. Ross , R. Rinker , W. Najjar, Mapping a Single Assignment Programming Language to Reconfigurable Systems, The Journal of Supercomputing, v.21 n.2, p.117-130, February 2002
[doi> 10.1023/A:1013623303037
]
|
 |
10
|
|
 |
11
|
|
 |
12
|
Wang Chen , Panos Kosmas , Miriam Leeser , Carey Rappaport, An FPGA implementation of the two-dimensional finite-difference time-domain (FDTD) algorithm, Proceedings of the 2004 ACM/SIGDA 12th international symposium on Field programmable gate arrays, February 22-24, 2004, Monterey, California, USA
[doi> 10.1145/968280.968311]
|
| |
13
|
Christensen, F. 2004. A scalable software-defined radio development system. Xcell J., Winter.
|
| |
14
|
Paul Chow , Jonathan Rose , Soon Ong Seo , Kevin Chung , Gerard Páez-Monzón , Immanuel Rahardja, The design of an SRAM-based field-programmable gate array—part I: architecture, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, v.7 n.2, p.191-197, June 1999
[doi> 10.1109/92.766746
]
|
| |
15
|
|
| |
16
|
Cifuentes, C., Simon, D., and Fraboulet, A. 1998. Assembly to high-level language translation. Department of Computer Science and Electrical Engineering, University of Queensland. Tech. Rep. 439.
|
| |
17
|
Cifuentes, C., Van Emmerik, M., Ung, D., Simon, D., and Waddington, T. 1999. Preliminary experiences with the use of the UQBT binary translation framework. In Proceedings of the Workshop on Binary Translation, 12--22.
|
| |
18
|
Critical Blue. 2005. http://www.criticalblue.com.
|
| |
19
|
D.H. Brown Associates. 2004. Cray XD1 brings high-bandwidth supercomputing to the mid-market. White Paper prepared for Cray, Inc., http://www.cray.com/downloads/dhbrown_crayxd1_ oct2004.pdf.
|
| |
20
|
EEMBC. 2005. The Embedded Microprocessor Benchmark Consortium. http://www.eembc.org.
|
| |
21
|
Eles, P., Peng, Z., Kuchchinski, K., and Doboli, A. 1997. System level hardware/software partitioning based on simulated annealing and Tabu search. Kluwer's Design Automation for Embedded Systems 2, 1, 5--32.
|
| |
22
|
|
| |
23
|
Gajski, D., Vahid, F., Narayan, S., and Gong, J. 1998. SpecSyn: An environment supporting the specify-explore-refine paradigm for hardware/software system design. IEEE Trans. Very Large Scale Integration Syst. (TVLSI) 6, 1, 84--100.
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
|
 |
28
|
|
 |
29
|
|
| |
30
|
Chunho Lee , Miodrag Potkonjak , William H. Mangione-Smith, MediaBench: a tool for evaluating and synthesizing multimedia and communicatons systems, Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, p.330-335, December 01-03, 1997, Research Triangle Park, North Carolina, United States
|
| |
31
|
|
| |
32
|
|
 |
33
|
|
| |
34
|
|
 |
35
|
|
 |
36
|
|
| |
37
|
|
| |
38
|
Matsumoto, C. 2000. Triscend adds 32-bit configurable SoC line. EE Times, http://www. eet.com/story/OEG20000828S0015.
|
| |
39
|
|
 |
40
|
Gaurav Mittal , David C. Zaretsky , Xiaoyong Tang , P. Banerjee, Automatic translation of software binaries onto FPGAs, Proceedings of the 41st annual conference on Design automation, June 07-11, 2004, San Diego, CA, USA
[doi> 10.1145/996566.996678]
|
| |
41
|
Morris, K. 2005. Cray goes FPGA. FPGA and Programmable Logic J., April.
|
| |
42
|
|
| |
43
|
Singh, S., Rose, J., Chow, P., and Lewis, D. 1992. The effect of logic block architecture on FPGA performance. IEEE J. Solid-State Circuits. 27, 3, 3--12.
|
 |
44
|
|
| |
45
|
|
 |
46
|
|
 |
47
|
Greg Stitt , Frank Vahid , Gordon McGregor , Brian Einloth, Hardware/software partitioning of software binaries: a case study of H.264 decode, Proceedings of the 3rd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, p.285-290, September 19-21, 2005, Jersey City, NJ, USA
[doi> 10.1145/1084834.1084905]
|
| |
48
|
Tensilica, Inc. 2006. XPRES compiler, automatically generate processors from standard C code. http://www.tensilica.com/products/xpres.htm.
|
| |
49
|
Triscend Corp. 2003. http://www.triscend.com.
|
 |
50
|
Girish Venkataramani , Walid Najjar , Fadi Kurdahi , Nader Bagherzadeh , Wim Bohm, A compiler framework for mapping applications to a coarse-grained reconfigurable computer architecture, Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, November 16-17, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502217.502235]
|
 |
51
|
|
| |
52
|
Xilinx, Inc. 2006. http://www.xilinx.com.
|
| |
53
|
Xilinx, Inc. 2005a. Customer success stories, http://www.xilinx.com/company/success/csprod. htm#embedded.
|
| |
54
|
Xilinx, Inc. 2005b. Virtex-4 FPGAs, http://www.xilinx.com/products/silicon_solutions/fpgas/ virtex/virtex4/index.htm.
|
| |
55
|
Xilinx, Inc. 2004a. Partnering for success, Xilinx and photonic bridges. http://www.xilinx.com/ ipcenter/processor_central/embedded/success_PB.pdf.
|
| |
56
|
Xilinx, Inc. 2004b. Virtex-II Pro/ProX FPGAs, http://www.xilinx.com/products/silicon_solutions/ fpgas/virtex/virtex_ii_pro_fpgas/.
|
| |
57
|
Xilinx, Inc. 2000a. Xilinx introduces high level language compiler for Virtex FPGAs. Xilinx Press Release. http://www.xilinx.com/prs_rls/00119_forge.htm.
|
| |
58
|
Xilinx, Inc. 2000b. Xilinx Version 3.3i software doubles clock frequencies. Xilinx Press Release. http://www.xilinx.com/prs_rls/00118_3_3i.htm.
|
| |
59
|
Marco Zagha , Brond Larson , Steve Turner , Marty Itzkowitz, Performance analysis using the MIPS R10000 performance counters, Proceedings of the 1996 ACM/IEEE conference on Supercomputing (CDROM), p.16-es, January 01-01, 1996, Pittsburgh, Pennsylvania, United States
[doi> 10.1145/369028.369059]
|
 |
60
|
Xiaolan Zhang , Zheng Wang , Nicholas Gloy , J. Bradley Chen , Michael D. Smith, System support for automatic profiling and optimization, Proceedings of the sixteenth ACM symposium on Operating systems principles, p.15-26, October 05-08, 1997, Saint Malo, France
|
| |
61
|
|
|