skip to main content
research-article

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Published:01 May 2011Publication History
Skip Abstract Section

Abstract

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.

References

  1. Aho, A., Lam, M., Sethi, R., and Ullman, J. 2006. Compilers: Principles, Techniques and Toos 2nd Ed. Addison-Wesley, Reading, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H., Eds. 2000. Templates for the Soution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Beauchamp, M., Hauck, S., Underwood, K., and Hemmert, S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA’06). ACM Press, New York, 12--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Böhm, W., Draper, B., Najjar, W., Hammes, J., Rinker, R., Chawathe, M., and Ross, C. 2001. One-Step compilation of image processing applications to FPGAs. In Proceedings of the the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). IEEE Computer Society Press, Los Alamitos, CA, 209--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bouganis, C.-S., Park, S.-B., Constantinides, G., and Cheung, P. 2009. Synthesis and optimization of 2D filter designs for heterogeneous FPGAs. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Callanan, O., Gregg, D., Nisbet, A., and Peardon, M. 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’06). 1--6.Google ScholarGoogle Scholar
  7. Cohen, A., DeVore, R. A., and Hochmuth, R. 2000. Restricted nonlinear approximation. Construct. Approx. 16, 1, 85--113.Google ScholarGoogle ScholarCross RefCross Ref
  8. Coifman, R. and Wickerhauser, M. 1992. Entropy-Based algorithms for best basis selection. IEEE Trans. Inf. Theory 38, 2, 713--718.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Constantinides, G., Cheung, P., and Luk, W. 2003. Synthesis of saturation arithmetic architectures. ACM Trans. Des. Autom. Electron. Syst. 8, 3, 334--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. d’Alberto, P., Milder, P., Sandryhaila, A., Franchetti, F., Hoe, J., Moura, J., Püschel, M., and Johnson, J. 2007. Generating FPGA accelerated DFT libraries. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). IEEE Computer Society Press, Los Alamitos, CA, 173--184. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. DeVore, R. A., Jawerth, B., and Popov, V. 1992. Compression of wavelet decompositions. Amer. J. Math. 114, 737--785.Google ScholarGoogle ScholarCross RefCross Ref
  13. Donoho, D. and Elad, M. 2003. Optimal sparse representations in general (nonorthogonal) dictionaries via ℓ<sup>1</sup> minimization. Proc. Nat. Acad. Sci. 100, 2197--2202.Google ScholarGoogle ScholarCross RefCross Ref
  14. Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 86--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ellis, D. 2003. Recoded digits archive at Columbia University.Google ScholarGoogle Scholar
  16. Fahmy, S., Cheung, P., and Luk, W. 2005. Novel FPGA-based implementation of median and weighted median filters for image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05). 142--147.Google ScholarGoogle Scholar
  17. Frigo, M. 1999. A fast fourier transform compiler. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI’99). ACM Press, New York. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41, 12, 3397--3415.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Marcus, G. and Nolazco-FIores, J. A. 2005. An FPGA-based coprocessor for the SPHINX speech recognition system: Early experiences. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05). IEEE Computer Society, 27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mathew, B., Davis, A., and Fang, Z. 2003. A low-power accelerator for the SPHINX 3 speech recognition system. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Melnikoff, S., Quigley, S., and Russell, M. 2002. Implementing a simple continuous speech recognition system on an FPGA. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Milder, P., Franchetti, F., Hoe, J., and Pschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings Design Automation Conference (DAC). 385--390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ortigosa, E., Ortigosa, P., Canas, A., Ros, E., Agis, R., and Ortega, J. 2003. FPGA implementation of multi-layer perceptrons for speech recognition. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL’03). Lecture Notes in Computer Science, vol. 2778. Springer.Google ScholarGoogle Scholar
  24. Park, I.-C. and Kang, H.-J. 2001. Digital filter synthesis based on minimal signed digit representation. In Proceedings of the 38th Conference on Design Automation (DAC’01). ACM Press, New York, 468--473. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers. 40--44.Google ScholarGoogle Scholar
  26. Proakis, J. G. and Manolakis, D. K. 1996. Digital Signal Processing: Principles, Algorithms and Applications 3rd Ed. Prentice-Hall, NJ. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE (Special Issue on Program Generation, Optimization, and Adaptation) 93, 2, 232--275.Google ScholarGoogle ScholarCross RefCross Ref
  28. Saito, N. and Coifman, R. 1994. Local discriminant bases. In SPIE Math. Imag. Wavelet Appl. Signal Image Process. (SPIE). vol. 2303.Google ScholarGoogle Scholar
  29. Temam, O. and Jalby, W. 1992. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing (SC’92). IEEE Computer Society Press, 578--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Thiele, C. M. and Villemoes, L. F. 1996. A fast algorithm for adapted time frequency tilings. Appl. Comput. Harmon. Anal. 3, 91--99.Google ScholarGoogle ScholarCross RefCross Ref
  31. Tropp, J. A. 2003. Greed is good: Algorithmic results for sparse approximation. Tech. rep. 2003--04, Texas Institute for Computational and Applied Mathematics.Google ScholarGoogle Scholar
  32. Underwood, K. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th ACM International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM Press, New York, 171--180. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Vuduc, R. and Moon, H.-J. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the International Conference on High Performance Computing and Communcations (HPCC). Lecture Notes in Computer Science, vol. 3726. Springer, 807--816. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Woods, R., McAllister, J., and Lightbody, G. 2008. FPGA-Based Implementation of Signal Processing Systems. John Wiley & Sons. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Xilinx Corp. 2007. Virtex-II Pro<sup>TM</sup> platform FPGAs: Complete data sheet.Google ScholarGoogle Scholar
  36. Xilinx Corp. 2008. Virtex-3<sup>TM</sup> platform FPGAs: Complete data sheet.Google ScholarGoogle Scholar
  37. Xilinx Corp. 2009. Virtex-5<sup>TM</sup> platform FPGAs: Complete data sheet.Google ScholarGoogle Scholar
  38. Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 63--74. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Domain-Specific Optimization of Signal Recognition Targeting FPGAs

            Recommendations

            Reviews

            Vivek Venugopal

            Demertzi et al. present a data optimization algorithm to identify the reuse and scheduling of computation blocks in signal processing applications. The authors demonstrate their algorithm implementation using Xilinx Spartan-3 (xc3s1500), Virtex-II Pro (xc2vp30), and Virtex-5 (xc5vlx110) devices. The paper describes a signal recognition and processing application with modified mathematical, software, and hardware optimizations for implementing it. It uses mostly fast Fourier transform (FFT) algorithms for signal recognition, which require floating-point arithmetic and significant usage of memory processor bandwidth, power, and energy. The authors use the Walsh-Hadamard transform combined with the BestBasis algorithm to reduce the number of arithmetic operations. They use a common subexpression elimination step to reduce the unnecessary data accesses and the memory footprint. The computation reuse algorithm is a greedy algorithm, and starts with longer patterns in order to reuse large computation blocks. The authors use a training set of ten 4096 x 4096 matrices for their experiments, and evaluate their algorithm on the basis of computation reduction, impact of using variations in the reuse algorithm, impact of parallelization, impact on storage requirements, and area/speedup comparison. The authors also compare the field-programmable gate array (FPGA) implementations with a software implementation targeted on an Intel Core quad core central processing unit (CPU), clocked at 2.33 GHz, and find the FPGA implementation to be 10.25 times faster. This paper is a good starting point for researchers who want to implement signal recognition processing algorithms that use less computational resources. Online Computing Reviews Service

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Reconfigurable Technology and Systems
              ACM Transactions on Reconfigurable Technology and Systems  Volume 4, Issue 2
              May 2011
              216 pages
              ISSN:1936-7406
              EISSN:1936-7414
              DOI:10.1145/1968502
              Issue’s Table of Contents

              Copyright © 2011 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 May 2011
              • Accepted: 1 December 2009
              • Revised: 1 September 2009
              • Received: 1 February 2009
              Published in trets Volume 4, Issue 2

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed
            • Article Metrics

              • Downloads (Last 12 months)5
              • Downloads (Last 6 weeks)1

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader