Abstract
Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.
- Aho, A., Lam, M., Sethi, R., and Ullman, J. 2006. Compilers: Principles, Techniques and Toos 2nd Ed. Addison-Wesley, Reading, MA. Google ScholarDigital Library
- Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H., Eds. 2000. Templates for the Soution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
- Beauchamp, M., Hauck, S., Underwood, K., and Hemmert, S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA’06). ACM Press, New York, 12--20. Google ScholarDigital Library
- Böhm, W., Draper, B., Najjar, W., Hammes, J., Rinker, R., Chawathe, M., and Ross, C. 2001. One-Step compilation of image processing applications to FPGAs. In Proceedings of the the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). IEEE Computer Society Press, Los Alamitos, CA, 209--218. Google ScholarDigital Library
- Bouganis, C.-S., Park, S.-B., Constantinides, G., and Cheung, P. 2009. Synthesis and optimization of 2D filter designs for heterogeneous FPGAs. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1--28. Google ScholarDigital Library
- Callanan, O., Gregg, D., Nisbet, A., and Peardon, M. 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’06). 1--6.Google Scholar
- Cohen, A., DeVore, R. A., and Hochmuth, R. 2000. Restricted nonlinear approximation. Construct. Approx. 16, 1, 85--113.Google ScholarCross Ref
- Coifman, R. and Wickerhauser, M. 1992. Entropy-Based algorithms for best basis selection. IEEE Trans. Inf. Theory 38, 2, 713--718.Google ScholarDigital Library
- Constantinides, G., Cheung, P., and Luk, W. 2003. Synthesis of saturation arithmetic architectures. ACM Trans. Des. Autom. Electron. Syst. 8, 3, 334--354. Google ScholarDigital Library
- d’Alberto, P., Milder, P., Sandryhaila, A., Franchetti, F., Hoe, J., Moura, J., Püschel, M., and Johnson, J. 2007. Generating FPGA accelerated DFT libraries. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). IEEE Computer Society Press, Los Alamitos, CA, 173--184. Google ScholarDigital Library
- deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York. Google ScholarDigital Library
- DeVore, R. A., Jawerth, B., and Popov, V. 1992. Compression of wavelet decompositions. Amer. J. Math. 114, 737--785.Google ScholarCross Ref
- Donoho, D. and Elad, M. 2003. Optimal sparse representations in general (nonorthogonal) dictionaries via ℓ<sup>1</sup> minimization. Proc. Nat. Acad. Sci. 100, 2197--2202.Google ScholarCross Ref
- Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 86--95. Google ScholarDigital Library
- Ellis, D. 2003. Recoded digits archive at Columbia University.Google Scholar
- Fahmy, S., Cheung, P., and Luk, W. 2005. Novel FPGA-based implementation of median and weighted median filters for image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05). 142--147.Google Scholar
- Frigo, M. 1999. A fast fourier transform compiler. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI’99). ACM Press, New York. Google ScholarDigital Library
- Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41, 12, 3397--3415.Google ScholarDigital Library
- Marcus, G. and Nolazco-FIores, J. A. 2005. An FPGA-based coprocessor for the SPHINX speech recognition system: Early experiences. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05). IEEE Computer Society, 27. Google ScholarDigital Library
- Mathew, B., Davis, A., and Fang, Z. 2003. A low-power accelerator for the SPHINX 3 speech recognition system. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03). Google ScholarDigital Library
- Melnikoff, S., Quigley, S., and Russell, M. 2002. Implementing a simple continuous speech recognition system on an FPGA. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02). Google ScholarDigital Library
- Milder, P., Franchetti, F., Hoe, J., and Pschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings Design Automation Conference (DAC). 385--390. Google ScholarDigital Library
- Ortigosa, E., Ortigosa, P., Canas, A., Ros, E., Agis, R., and Ortega, J. 2003. FPGA implementation of multi-layer perceptrons for speech recognition. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL’03). Lecture Notes in Computer Science, vol. 2778. Springer.Google Scholar
- Park, I.-C. and Kang, H.-J. 2001. Digital filter synthesis based on minimal signed digit representation. In Proceedings of the 38th Conference on Design Automation (DAC’01). ACM Press, New York, 468--473. Google ScholarDigital Library
- Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers. 40--44.Google Scholar
- Proakis, J. G. and Manolakis, D. K. 1996. Digital Signal Processing: Principles, Algorithms and Applications 3rd Ed. Prentice-Hall, NJ. Google ScholarDigital Library
- Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE (Special Issue on Program Generation, Optimization, and Adaptation) 93, 2, 232--275.Google ScholarCross Ref
- Saito, N. and Coifman, R. 1994. Local discriminant bases. In SPIE Math. Imag. Wavelet Appl. Signal Image Process. (SPIE). vol. 2303.Google Scholar
- Temam, O. and Jalby, W. 1992. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing (SC’92). IEEE Computer Society Press, 578--587. Google ScholarDigital Library
- Thiele, C. M. and Villemoes, L. F. 1996. A fast algorithm for adapted time frequency tilings. Appl. Comput. Harmon. Anal. 3, 91--99.Google ScholarCross Ref
- Tropp, J. A. 2003. Greed is good: Algorithmic results for sparse approximation. Tech. rep. 2003--04, Texas Institute for Computational and Applied Mathematics.Google Scholar
- Underwood, K. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th ACM International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM Press, New York, 171--180. Google ScholarDigital Library
- Vuduc, R. and Moon, H.-J. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the International Conference on High Performance Computing and Communcations (HPCC). Lecture Notes in Computer Science, vol. 3726. Springer, 807--816. Google ScholarDigital Library
- Woods, R., McAllister, J., and Lightbody, G. 2008. FPGA-Based Implementation of Signal Processing Systems. John Wiley & Sons. Google ScholarDigital Library
- Xilinx Corp. 2007. Virtex-II Pro<sup>TM</sup> platform FPGAs: Complete data sheet.Google Scholar
- Xilinx Corp. 2008. Virtex-3<sup>TM</sup> platform FPGAs: Complete data sheet.Google Scholar
- Xilinx Corp. 2009. Virtex-5<sup>TM</sup> platform FPGAs: Complete data sheet.Google Scholar
- Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 63--74. Google ScholarDigital Library
Index Terms
- Domain-Specific Optimization of Signal Recognition Targeting FPGAs
Recommendations
Computation reuse in domain-specific optimization of signal recognition
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arraysDomain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that ...
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysFPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Domain-Specific Language for HW/SW Co-design for FPGAs
DSL '09: Proceedings of the IFIP TC 2 Working Conference on Domain-Specific LanguagesThis article describes FSMLanguage, a domain-specific language for HW/SW co-design targeting platform FPGAs. Modern platform FPGAs provide a wealth of configurable logic in addition to embedded processors, distributed RAM blocks, and DSP slices in order ...
Comments