ABSTRACT
The inverse truncated Fourier transform (ITFT) is a key component in the fast polynomial and large integer algorithms introduced by van der Hoeven. This paper reports a high performance implementation of the ITFT which poses additional challenges compared to that of the forward transform. A general-radix variant of the ITFT algorithm is developed to allow the implementation to automatically adapt to the memory hierarchy. Then a parallel ITFT algorithm is developed that trades off small arithmetic cost for full vectorization and improved multi-threaded parallelism. The algorithms are automatically generated and tuned to produce an arbitrary-size ITFT library. The new algorithms and the implementation smooths out the staircase performance associated with power-of-two modular FFT implementations, and provide significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.
- A. Arnold. A new truncated fourier transform algorithm. In Proceedings of the 38th International Symposium on International Symposium on Symbolic and Algebraic Computation, ISSAC '13, pages 15--22, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- D. H. Bailey. FFTs in external or hierarchical memory. J. Supercomput., 4(1):23--35, Mar. 1990. Google ScholarDigital Library
- L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. Audio and Electroacoustics, IEEE Transactions on, 18(4):451--455, 1970.Google Scholar
- J. Cooley and J. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297--301, 1965.Google ScholarCross Ref
- F. Franchetti and M. Püschel. Generating high-performance pruned FFT implementations. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 549--552, 2009. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Languages Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. FFT program generation for shared memory: SMP and multicore. In Supercomputing (SC), 2006. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. A rewriting system for the vectorization of signal transforms. In High Performance Computing for Computational Science (VECPAR), volume 4395 of Lecture Notes in Computer Science, pages 363--377. Springer, 2006. Google ScholarDigital Library
- M. Frigo and S. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
- J. V. Z. Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 2 edition, 2003. Google ScholarDigital Library
- I. J. Good. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 20(2): pp. 361--372, 1958.Google ScholarCross Ref
- D. Harvey. A cache-friendly truncated FFT. Theor. Comput. Sci., 410(27-29):2649--2658, June 2009. Google ScholarDigital Library
- D. Harvey and D. S. Roche. An in-place truncated fourier transform and applications to polynomial multiplication. In Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ISSAC '10, pages 325--329, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- T. Mateer. Fast Fourier transform algorithms with applications. PhD thesis, Citeseer, 2008. Google ScholarDigital Library
- X. Li, M. Moreno Maza, R. Rasheed, and E. Schost. The modpn library: Bringing fast polynomial arithmetic into Maple. J. Symb. Comput., 46(7):841--858, July 2011. Google ScholarDigital Library
- M. M. Maza and Y. Xie. FFT-based dense polynomial arithmetic on multi-cores. In HPCS, pages 378--399, 2009. Google ScholarDigital Library
- L. Meng and J. Johnson. Automatic parallel library generation for general-size modular FFT algorithms. In V. Gerdt, W. Koepf, E. Mayr, and E. Vorozhtsov, editors, Computer Algebra in Scientific Computing, volume 8136 of Lecture Notes in Computer Science, pages 243--256. Springer International Publishing, 2013. Google ScholarDigital Library
- L. Meng and J. Johnson. High performance implementation of the TFT. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC '14, pages 328--334, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- L. Meng, Y. Voronenko, J. R. Johnson, M. Moreno Maza, F. Franchetti, and Y. Xie. Spiral-generated modular FFT algorithms. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, PASCO '10, pages 169--170, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- P. L. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519--521, 1985.Google ScholarCross Ref
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarCross Ref
- C. Rader. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 56(6):1107--1108, 1968.Google ScholarCross Ref
- J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290--296, Univ. of Cantabria, Santander, Spain, July 4--7 2004. Google ScholarDigital Library
- Y. Voronenko. Library generation for linear transforms. PhD thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008. Google ScholarDigital Library
- J. Xiong, J. Johnson, R. W. Johnson, and D. Padua. SPL: A language and compiler for DSP algorithms. In Programming Languages Design and Implementation (PLDI), pages 298--308, 2001. Google ScholarDigital Library
Index Terms
- High performance implementation of the inverse TFT
Recommendations
High performance implementation of the TFT
ISSAC '14: Proceedings of the 39th International Symposium on Symbolic and Algebraic ComputationThis paper reports on a high-performance implementation of the truncated Fourier transform (TFT). A general Cooley-Tukey like algorithm for the TFT is developed that allows the implementation to automatically adapt to the memory hierarchy. Then the ...
Implementation and Performance of the Fast Hartley Transform
The authors investigated the implementation aspects of the fast Hartley transform (FHT) in both software and hardware. They describe the modifications required to convert existing fast Fourier transform (FFT) programs to execute FHTs, showing the ease ...
New systolic array implementation of the 2-D discrete cosine transform and its inverse
A new systolic array without matrix transposition hardware is proposed to compute the two-dimensional discrete cosine transform (2-D DCT) based on the row-column decomposition. This architecture uses N2 multipliers to evaluate N×N-point DCTs at a ...
Comments