ABSTRACT
We propose a novel FFT datapath that reduces the memory requirement compared to state-of-the-art RAM-based implementations by up to a factor of two. The novelty is in a technique to fuse the datapaths for the required perfect shuffle and bit reversal and is applicable to an entire design space of FFT implementations with varying degrees of reuse and number of input ports. We implemented a tool to generate this FFT design space for a given input size and to benchmark against prior work. The results show a reduction of half the RAM banks and/or half the logic complexity used for the permutations. The technique for fusing permutations is more generally applicable beyond the FFT.
- D. Cohen, “Simplified control of FFT hardware,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 6, pp. 577--579, 1976.Google ScholarCross Ref
- P. Kumhom, J. R. Johnson, and P. Nagvajara, “Design, optimization, and implementation of a universal FFT processor,” in Proc. International ASIC/SOC Conference (ASIC), pp. 182--186, 2000.Google ScholarCross Ref
- A. Cortés, I. Vélez, and J. F. Sevillano, “Radix $r^k $FFTs: Matricial representation and SDC/SDF pipeline implementation,” IEEE Transactions on Signal Processing, vol. 57, no. 7, pp. 2824--2839, 2009. Google ScholarDigital Library
- S. He and M. Torkelson, “A new approach to pipeline FFT processor,” in Proc. Parallel Processing Symposium (IPPS), pp. 766--770, 1996. Google ScholarDigital Library
- B. Akin, F. Franchetti, and J. C. Hoe, “FFTs with near-optimal memory access through block data layouts: Algorithm, architecture and design automation,” Journal of Signal Processing Systems, vol. 85, no. 1, pp. 67--82, 2015. Google ScholarDigital Library
- M. C. Pease, “An adaptation of the fast Fourier transform for parallel processing,” Journal of the ACM, vol. 15, no. 2, pp. 252--264, 1968. Google ScholarDigital Library
- J. H. Takala, T. S. Jarvinen, and H. T. Sorokin, “Conflict-free parallel memory access scheme for FFT processors,” in Proc. International Symposium on Circuits and Systems (ISCAS), vol. 4, pp. 524--527, 2003.Google ScholarCross Ref
- K. J. Page, J. F. Arrigo, and P. M. Chau, “Reconfigurable-hardware-based digital signal processing for wireless communications,” in Advanced Signal Processing Algorithms, Architectures and Implementations (P. SPIE, ed.), vol. 3162, pp. 529--540, 1997.Google Scholar
- G. Nordin, P. A. Milder, J. C. Hoe, and M. Püschel, “Automatic generation of customized discrete Fourier transform IPs,” in Proc. Design Automation Conference (DAC), pp. 471--474, 2005. Google ScholarDigital Library
- P. A. Milder, F. Franchetti, J. C. Hoe, and M. Püschel, “Computer generation of hardware for linear digital signal processing transforms,” Transactions on Design Automation of Electronic Systems (TODAES), vol. 17, no. 2, pp. 15:1--15:33, 2012. Google ScholarDigital Library
- P. A. Milder, J. C. Hoe, and M. Püschel, “Automatic generation of streaming datapaths for arbitrary fixed permutations,” in Proc. Design, Automation and Test in Europe (DATE), pp. 1118--1123, 2009. Google ScholarDigital Library
- T. Koehn and P. Athanas, “Arbitrary streaming permutations with minimum memory and latency,” in Proc. International Conference on Computer-Aided Design (ICCAD), pp. 1--6, 2016. Google ScholarDigital Library
- K. K. Parhi, “Systematic synthesis of DSP data format converters using life-time analysis and forward-backward register allocation,” IEEE Transactions on Circuits and Systems II (TCAS-II), vol. 39, no. 7, pp. 423--440, 1992.Google ScholarCross Ref
- K. J. Page and P. M. Chau, “Folding large regular computational graphs onto smaller processor arrays,” in Advanced Signal Processing Algorithms, Architectures and Implementations (P. SPIE, ed.), vol. 2846, pp. 383--394, 1996.Google Scholar
- T. J"arvinen, P. Salmela, H. Sorokin, and J. Takala, “Stride permutation networks for array processors,” in Proc. International Conference on Application-Specific Systems, Architectures and Processors Proceedings (ASAP), pp. 376--386, 2004. Google ScholarDigital Library
- M. Püschel, P. A. Milder, and J. C. Hoe, “Permuting streaming data using RAMs,” Journal of the ACM, vol. 56, no. 2, pp. 10:1--10:34, 2009. Google ScholarDigital Library
- F. Serre, T. Holenstein, and M. Püschel, “Optimal circuits for streamed linear permutations using RAM,” in Proc. International Symposium on Field-Programmable Gate Arrays (FPGA), pp. 215--223, 2016. Google ScholarDigital Library
- F. Serre, “DFT and streamed linear permutation generator for hardware.” https://acl.inf.ethz.ch/research/hardware/, 2018.Google Scholar
- M. C. Pease, “The indirect binary n-cube microprocessor array,” IEEE Transactions on Computers, vol. 26, no. 5, pp. 458--473, 1977. Google ScholarDigital Library
- J. Lenfant and S. Tahé, “Permuting data with the Omega network,” Acta Informatica, vol. 21, no. 6, pp. 629--641, 1985.Google ScholarCross Ref
- F. Serre and M. P" uschel, “Generalizing block LU factorization: A lower-upper-lower block triangular decomposition with minimal off-diagonal ranks,” Linear Algebra and its Applications, vol. 509, pp. 114--142, 2016.Google ScholarCross Ref
- F. de Dinechin and B. Pasca, "Designing custom arithmetic data paths with FloPoCo," IEEE Design & Test of Computers, vol. 28, no. 4, pp. 18--27, 2011. Google ScholarDigital Library
- P. A. Milder, “Spiral DFT/FFT IP core generator.” http://www.spiral.net/hardware/dftgen.html, 2008.Google Scholar
- A. Waksman, “A permutation network,” Journal of the ACM, vol. 15, no. 1, pp. 159--163, 1968. Google ScholarDigital Library
- B. G. Jo and M. H. Sunwoo, “New continuous-flow mixed-radix (CFMR) FFT processor using novel in-place strategy,” IEEE Transactions on Circuits and Systems I (TCAS-I), vol. 52, no. 5, pp. 911--919, 2005.Google ScholarCross Ref
- M. Garrido, M. Ángel Sánchez, M. L. López-Vallejo, and J. Grajal, “A 4096-point radix-4 memory-based FFT using DSP slices,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 1, pp. 375--379, 2017. Google ScholarDigital Library
Index Terms
- Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations
Recommendations
Optimal Circuits for Streamed Linear Permutations Using RAM
FPGA '16: Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate ArraysWe propose a method to automatically derive hardware structures that perform a fixed linear permutation on streaming data. Linear permutations are permutations that map linearly the bit representation of the elements addresses. This set contains many of ...
Permuting streaming data using RAMs
This article presents a method for constructing hardware structures that perform a fixed permutation on streaming data. The method applies to permutations that can be represented as linear mappings on the bit-level representation of the data locations. ...
Algorithm 338: algol procedures for the fast Fourier transform
The following procedures are based on the Cooley-Tukey algorithm [1] for computing the finite Fourier transform of a complex data vector; the dimension of the data vector is assumed here to be a power of two. Procedure COMPLEXTRANSFORM computes either ...
Comments