research-article

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Authors:
Melina Demertzi

University of Southern California

University of Southern California
View Profile

,
Pedro C. Diniz

INESC-ID, Lisboa

INESC-ID, Lisboa
View Profile

,
Mary W. Hall

University of Utah

University of Utah
View Profile

,
Anna C. Gilbert

University of Michigan

University of Michigan
View Profile

,
Yi Wang

University of Michigan

University of Michigan
View Profile

ACM Transactions on Reconfigurable Technology and Systems Volume 4 Issue 2Article No.: 17pp 1–26https://doi.org/10.1145/1968502.1968508

Published:01 May 2011Publication History

ACM Transactions on Reconfigurable Technology and Systems

Abstract

Domain-specific optimizations on matrix computations exploiting specific arithmetic and matrix representation formats have achieved significant performance/area gains in Field-Programmable Gate Array (FPGA) hardware designs. In this article, we explore the application of data-driven optimizations to reduce both storage and computation requirements to the problem of signal recognition from a known dictionary. By starting with a high-level mathematical representation of a signal recognition problem, we perform optimizations across the layers of the system, exploiting mathematical structure to improve implementation efficiency. Specifically, we use Walsh wavelet packets in conjunction with a BestBasis algorithm to distinguish between spoken digits. The resulting transform matrices are quite sparse, and exhibit a rich algebraic structure that contains significant overlap across rows. As a consequence, dot-product computations of the transform matrix and signal vectors exhibit significant computation reuse, or repeated identical computations. We present an algorithm for identifying this computation reuse and scheduling of the row computations. We exploit this reuse to derive FPGA hardware implementations that reduce the amount of computation for an individual matrix by as much as 6.35× and an average of 2× for a single dot-product unit. The implementation that exploits reuse achieves a 2× computation reduction compared to three concurrently-executing simpler accumulator units with the same aggregate design area and outperforms software implementations on high-end desktop personal computers.

References

Aho, A., Lam, M., Sethi, R., and Ullman, J. 2006. Compilers: Principles, Techniques and Toos 2nd Ed. Addison-Wesley, Reading, MA. Google ScholarDigital Library
Bai, Z., Demmel, J., Dongarra, J., Ruhe, A., and van der Vorst, H., Eds. 2000. Templates for the Soution of Algebraic Eigenvalue Problems: A Practical Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA. Google ScholarDigital Library
Beauchamp, M., Hauck, S., Underwood, K., and Hemmert, S. 2006. Embedded floating-point units in FPGAs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays (FPGA’06). ACM Press, New York, 12--20. Google ScholarDigital Library
Böhm, W., Draper, B., Najjar, W., Hammes, J., Rinker, R., Chawathe, M., and Ross, C. 2001. One-Step compilation of image processing applications to FPGAs. In Proceedings of the the 9th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’01). IEEE Computer Society Press, Los Alamitos, CA, 209--218. Google ScholarDigital Library
Bouganis, C.-S., Park, S.-B., Constantinides, G., and Cheung, P. 2009. Synthesis and optimization of 2D filter designs for heterogeneous FPGAs. ACM Trans. Reconfig. Technol. Syst. 1, 4, 1--28. Google ScholarDigital Library
Callanan, O., Gregg, D., Nisbet, A., and Peardon, M. 2006. High performance scientific computing using FPGAs with IEEE floating point and logarithmic arithmetic for lattice QCD. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’06). 1--6.Google Scholar
Cohen, A., DeVore, R. A., and Hochmuth, R. 2000. Restricted nonlinear approximation. Construct. Approx. 16, 1, 85--113.Google ScholarCross Ref
Coifman, R. and Wickerhauser, M. 1992. Entropy-Based algorithms for best basis selection. IEEE Trans. Inf. Theory 38, 2, 713--718.Google ScholarDigital Library
Constantinides, G., Cheung, P., and Luk, W. 2003. Synthesis of saturation arithmetic architectures. ACM Trans. Des. Autom. Electron. Syst. 8, 3, 334--354. Google ScholarDigital Library
d’Alberto, P., Milder, P., Sandryhaila, A., Franchetti, F., Hoe, J., Moura, J., Püschel, M., and Johnson, J. 2007. Generating FPGA accelerated DFT libraries. In Proceedings of the IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’07). IEEE Computer Society Press, Los Alamitos, CA, 173--184. Google ScholarDigital Library
deLorimier, M. and DeHon, A. 2005. Floating-Point sparse matrix-vector multiply for FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York. Google ScholarDigital Library
DeVore, R. A., Jawerth, B., and Popov, V. 1992. Compression of wavelet decompositions. Amer. J. Math. 114, 737--785.Google ScholarCross Ref
Donoho, D. and Elad, M. 2003. Optimal sparse representations in general (nonorthogonal) dictionaries via &ell;1 minimization. Proc. Nat. Acad. Sci. 100, 2197--2202.Google ScholarCross Ref
Dou, Y., Vassiliadis, S., Kuzmanov, G., and Gaydadjiev, G. 2005. 64-bit floating-point FPGA matrix multiplication. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 86--95. Google ScholarDigital Library
Ellis, D. 2003. Recoded digits archive at Columbia University.Google Scholar
Fahmy, S., Cheung, P., and Luk, W. 2005. Novel FPGA-based implementation of median and weighted median filters for image processing. In Proceedings of the International Conference on Field Programmable Logic and Applications (FPL’05). 142--147.Google Scholar
Frigo, M. 1999. A fast fourier transform compiler. In Proceedings of the ACM Conference on Programming Language Design and Implementation (PLDI’99). ACM Press, New York. Google ScholarDigital Library
Mallat, S. and Zhang, Z. 1993. Matching pursuits with time-frequency dictionaries. IEEE Trans. Signal Process. 41, 12, 3397--3415.Google ScholarDigital Library
Marcus, G. and Nolazco-FIores, J. A. 2005. An FPGA-based coprocessor for the SPHINX speech recognition system: Early experiences. In Proceedings of the International Conference on Reconfigurable Computing and FPGAs (ReConFig’05). IEEE Computer Society, 27. Google ScholarDigital Library
Mathew, B., Davis, A., and Fang, Z. 2003. A low-power accelerator for the SPHINX 3 speech recognition system. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES’03). Google ScholarDigital Library
Melnikoff, S., Quigley, S., and Russell, M. 2002. Implementing a simple continuous speech recognition system on an FPGA. In Proceedings of the 10th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM’02). Google ScholarDigital Library
Milder, P., Franchetti, F., Hoe, J., and Pschel, M. 2008. Formal datapath representation and manipulation for implementing DSP transforms. In Proceedings Design Automation Conference (DAC). 385--390. Google ScholarDigital Library
Ortigosa, E., Ortigosa, P., Canas, A., Ros, E., Agis, R., and Ortega, J. 2003. FPGA implementation of multi-layer perceptrons for speech recognition. In Proceedings of the 13th International Conference on Field Programmable Logic and Applications (FPL’03). Lecture Notes in Computer Science, vol. 2778. Springer.Google Scholar
Park, I.-C. and Kang, H.-J. 2001. Digital filter synthesis based on minimal signed digit representation. In Proceedings of the 38th Conference on Design Automation (DAC’01). ACM Press, New York, 468--473. Google ScholarDigital Library
Pati, Y. C., Rezaiifar, R., and Krishnaprasad, P. S. 1993. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition. In Proceedings of the 27th Annual Asilomar Conference on Signals, Systems, and Computers. 40--44.Google Scholar
Proakis, J. G. and Manolakis, D. K. 1996. Digital Signal Processing: Principles, Algorithms and Applications 3rd Ed. Prentice-Hall, NJ. Google ScholarDigital Library
Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., and Rizzolo, N. 2005. SPIRAL: Code generation for DSP transforms. In Proceedings of the IEEE (Special Issue on Program Generation, Optimization, and Adaptation) 93, 2, 232--275.Google ScholarCross Ref
Saito, N. and Coifman, R. 1994. Local discriminant bases. In SPIE Math. Imag. Wavelet Appl. Signal Image Process. (SPIE). vol. 2303.Google Scholar
Temam, O. and Jalby, W. 1992. Characterizing the behavior of sparse algorithms on caches. In Proceedings of Supercomputing (SC’92). IEEE Computer Society Press, 578--587. Google ScholarDigital Library
Thiele, C. M. and Villemoes, L. F. 1996. A fast algorithm for adapted time frequency tilings. Appl. Comput. Harmon. Anal. 3, 91--99.Google ScholarCross Ref
Tropp, J. A. 2003. Greed is good: Algorithmic results for sparse approximation. Tech. rep. 2003--04, Texas Institute for Computational and Applied Mathematics.Google Scholar
Underwood, K. 2004. FPGAs vs. CPUs: Trends in peak floating-point performance. In Proceedings of the 12th ACM International Symposium on Field Programmable Gate Arrays (FPGA’04). ACM Press, New York, 171--180. Google ScholarDigital Library
Vuduc, R. and Moon, H.-J. 2005. Fast sparse matrix-vector multiplication by exploiting variable block structure. In Proceedings of the International Conference on High Performance Computing and Communcations (HPCC). Lecture Notes in Computer Science, vol. 3726. Springer, 807--816. Google ScholarDigital Library
Woods, R., McAllister, J., and Lightbody, G. 2008. FPGA-Based Implementation of Signal Processing Systems. John Wiley & Sons. Google ScholarDigital Library
Xilinx Corp. 2007. Virtex-II ProTM platform FPGAs: Complete data sheet.Google Scholar
Xilinx Corp. 2008. Virtex-3TM platform FPGAs: Complete data sheet.Google Scholar
Xilinx Corp. 2009. Virtex-5TM platform FPGAs: Complete data sheet.Google Scholar
Zhuo, L. and Prasanna, V. K. 2005. Sparse matrix-vector multiplication on FPGAs. In Proceedings of the ACM/SIGDA 13th International Symposium on Field-Programmable Gate Arrays (FPGA’05). ACM Press, New York, 63--74. Google ScholarDigital Library

Index Terms

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

Recommendations

Computation reuse in domain-specific optimization of signal recognition
FPGA '09: Proceedings of the ACM/SIGDA international symposium on Field programmable gate arrays

Domain-specific optimizations that exploit specific arithmetic and representation formats have been shown to achieve significant performance/area gains in FPGA hardware designs. In this work, we describe an approach to domain-specific optimization that ...
Read More
In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)
FPGA '18: Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

FPGAs or ASICs? There is a long-running debate on this. FPGAs are extremely flexible while ASICs offer top efficiency but inflexible. We believe that FPGAs and ASICs are better together, to offer both flexible and efficient solutions. We propose single-...
Read More
Domain-Specific Language for HW/SW Co-design for FPGAs
DSL '09: Proceedings of the IFIP TC 2 Working Conference on Domain-Specific Languages

This article describes FSMLanguage, a domain-specific language for HW/SW co-design targeting platform FPGAs. Modern platform FPGAs provide a wealth of configurable logic in addition to embedded processors, distributed RAM blocks, and DSP slices in order ...
Read More

Reviews

Reviewer: Vivek Venugopal

Demertzi et al. present a data optimization algorithm to identify the reuse and scheduling of computation blocks in signal processing applications. The authors demonstrate their algorithm implementation using Xilinx Spartan-3 (xc3s1500), Virtex-II Pro (xc2vp30), and Virtex-5 (xc5vlx110) devices. The paper describes a signal recognition and processing application with modified mathematical, software, and hardware optimizations for implementing it. It uses mostly fast Fourier transform (FFT) algorithms for signal recognition, which require floating-point arithmetic and significant usage of memory processor bandwidth, power, and energy. The authors use the Walsh-Hadamard transform combined with the BestBasis algorithm to reduce the number of arithmetic operations. They use a common subexpression elimination step to reduce the unnecessary data accesses and the memory footprint. The computation reuse algorithm is a greedy algorithm, and starts with longer patterns in order to reuse large computation blocks. The authors use a training set of ten 4096 x 4096 matrices for their experiments, and evaluate their algorithm on the basis of computation reduction, impact of using variations in the reuse algorithm, impact of parallelization, impact on storage requirements, and area/speedup comparison. The authors also compare the field-programmable gate array (FPGA) implementations with a software implementation targeted on an Intel Core quad core central processing unit (CPU), clocked at 2.33 GHz, and find the FPGA implementation to be 10.25 times faster. This paper is a good starting point for researchers who want to implement signal recognition processing algorithms that use less computational resources. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Reconfigurable Technology and Systems Volume 4, Issue 2
May 2011
216 pages
ISSN:1936-7406
EISSN:1936-7414
DOI:10.1145/1968502
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2011
- Accepted: 1 December 2009
- Revised: 1 September 2009
- Received: 1 February 2009
Published in trets Volume 4, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Configurable architectures
FPGA
computation reuse
signal recognition
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 273
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Computation reuse in domain-specific optimization of signal recognition

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-Specific Language for HW/SW Co-design for FPGAs

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Domain-Specific Optimization of Signal Recognition Targeting FPGAs

ACM Transactions on Reconfigurable Technology and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Computation reuse in domain-specific optimization of signal recognition

In-Package Domain-Specific ASICs for Intel® Stratix® 10 FPGAs: A Case Study of Accelerating Deep Learning Using TensorTile ASIC(Abstract Only)

Domain-Specific Language for HW/SW Co-design for FPGAs

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media