research-article

High performance implementation of the inverse TFT

Authors:
Lingchuan Meng

Drexel University, Philadelphia, Pennsylvania

Drexel University, Philadelphia, Pennsylvania
View Profile

,
Jeremy Johnson

Drexel University, Philadelphia, Pennsylvania

Drexel University, Philadelphia, Pennsylvania
View Profile

PASCO '15: Proceedings of the 2015 International Workshop on Parallel Symbolic ComputationJuly 2015Pages 87–94https://doi.org/10.1145/2790282.2790292

Published:10 July 2015Publication History

PASCO '15: Proceedings of the 2015 International Workshop on Parallel Symbolic Computation

Pages 87–94

ABSTRACT

The inverse truncated Fourier transform (ITFT) is a key component in the fast polynomial and large integer algorithms introduced by van der Hoeven. This paper reports a high performance implementation of the ITFT which poses additional challenges compared to that of the forward transform. A general-radix variant of the ITFT algorithm is developed to allow the implementation to automatically adapt to the memory hierarchy. Then a parallel ITFT algorithm is developed that trades off small arithmetic cost for full vectorization and improved multi-threaded parallelism. The algorithms are automatically generated and tuned to produce an arbitrary-size ITFT library. The new algorithms and the implementation smooths out the staircase performance associated with power-of-two modular FFT implementations, and provide significant performance improvement over zero-padding approaches even when high-performance FFT libraries are used.

References

A. Arnold. A new truncated fourier transform algorithm. In Proceedings of the 38th International Symposium on International Symposium on Symbolic and Algebraic Computation, ISSAC '13, pages 15--22, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
D. H. Bailey. FFTs in external or hierarchical memory. J. Supercomput., 4(1):23--35, Mar. 1990. Google ScholarDigital Library
L. I. Bluestein. A linear filtering approach to the computation of discrete Fourier transform. Audio and Electroacoustics, IEEE Transactions on, 18(4):451--455, 1970.Google Scholar
J. Cooley and J. Tukey. An algorithm for the machine calculation of complex Fourier series. Mathematics of Computation, 19(90):297--301, 1965.Google ScholarCross Ref
F. Franchetti and M. Püschel. Generating high-performance pruned FFT implementations. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 549--552, 2009. Google ScholarDigital Library
F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Languages Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarDigital Library
F. Franchetti, Y. Voronenko, and M. Püschel. FFT program generation for shared memory: SMP and multicore. In Supercomputing (SC), 2006. Google ScholarDigital Library
F. Franchetti, Y. Voronenko, and M. Püschel. A rewriting system for the vectorization of signal transforms. In High Performance Computing for Computational Science (VECPAR), volume 4395 of Lecture Notes in Computer Science, pages 363--377. Springer, 2006. Google ScholarDigital Library
M. Frigo and S. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005.Google ScholarCross Ref
J. V. Z. Gathen and J. Gerhard. Modern Computer Algebra. Cambridge University Press, New York, NY, USA, 2 edition, 2003. Google ScholarDigital Library
I. J. Good. The interaction algorithm and practical Fourier analysis. Journal of the Royal Statistical Society. Series B (Methodological), 20(2): pp. 361--372, 1958.Google ScholarCross Ref
D. Harvey. A cache-friendly truncated FFT. Theor. Comput. Sci., 410(27-29):2649--2658, June 2009. Google ScholarDigital Library
D. Harvey and D. S. Roche. An in-place truncated fourier transform and applications to polynomial multiplication. In Proceedings of the 2010 International Symposium on Symbolic and Algebraic Computation, ISSAC '10, pages 325--329, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
T. Mateer. Fast Fourier transform algorithms with applications. PhD thesis, Citeseer, 2008. Google ScholarDigital Library
X. Li, M. Moreno Maza, R. Rasheed, and E. Schost. The modpn library: Bringing fast polynomial arithmetic into Maple. J. Symb. Comput., 46(7):841--858, July 2011. Google ScholarDigital Library
M. M. Maza and Y. Xie. FFT-based dense polynomial arithmetic on multi-cores. In HPCS, pages 378--399, 2009. Google ScholarDigital Library
L. Meng and J. Johnson. Automatic parallel library generation for general-size modular FFT algorithms. In V. Gerdt, W. Koepf, E. Mayr, and E. Vorozhtsov, editors, Computer Algebra in Scientific Computing, volume 8136 of Lecture Notes in Computer Science, pages 243--256. Springer International Publishing, 2013. Google ScholarDigital Library
L. Meng and J. Johnson. High performance implementation of the TFT. In Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation, ISSAC '14, pages 328--334, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
L. Meng, Y. Voronenko, J. R. Johnson, M. Moreno Maza, F. Franchetti, and Y. Xie. Spiral-generated modular FFT algorithms. In Proceedings of the 4th International Workshop on Parallel and Symbolic Computation, PASCO '10, pages 169--170, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
P. L. Montgomery. Modular multiplication without trial division. Mathematics of Computation, 44(170):519--521, 1985.Google ScholarCross Ref
M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232--275, 2005.Google ScholarCross Ref
C. Rader. Discrete Fourier transforms when the number of data samples is prime. Proceedings of the IEEE, 56(6):1107--1108, 1968.Google ScholarCross Ref
J. van der Hoeven. The truncated Fourier transform and applications. In J. Gutierrez, editor, Proc. ISSAC 2004, pages 290--296, Univ. of Cantabria, Santander, Spain, July 4--7 2004. Google ScholarDigital Library
Y. Voronenko. Library generation for linear transforms. PhD thesis, Electrical and Computer Engineering, Carnegie Mellon University, 2008. Google ScholarDigital Library
J. Xiong, J. Johnson, R. W. Johnson, and D. Padua. SPL: A language and compiler for DSP algorithms. In Programming Languages Design and Implementation (PLDI), pages 298--308, 2001. Google ScholarDigital Library

Index Terms

High performance implementation of the inverse TFT

Recommendations

High performance implementation of the TFT
ISSAC '14: Proceedings of the 39th International Symposium on Symbolic and Algebraic Computation

This paper reports on a high-performance implementation of the truncated Fourier transform (TFT). A general Cooley-Tukey like algorithm for the TFT is developed that allows the implementation to automatically adapt to the memory hierarchy. Then the ...
Read More
Implementation and Performance of the Fast Hartley Transform

The authors investigated the implementation aspects of the fast Hartley transform (FHT) in both software and hardware. They describe the modifications required to convert existing fast Fourier transform (FFT) programs to execute FHTs, showing the ease ...
Read More
New systolic array implementation of the 2-D discrete cosine transform and its inverse

A new systolic array without matrix transposition hardware is proposed to compute the two-dimensional discrete cosine transform (2-D DCT) based on the row-column decomposition. This architecture uses N² multipliers to evaluate N×N-point DCTs at a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PASCO '15: Proceedings of the 2015 International Workshop on Parallel Symbolic Computation
July 2015
256 pages
ISBN:9781450335997
DOI:10.1145/2790282
Editors:
Jean-Guillaume Dumas
Université Grenoble Alpes, France
,
Erich L. Kaltofen
North Carolina State University
,
General Chair:
Clément Pernet
Université Grenoble Alpes, France
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 July 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 73
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

High performance implementation of the inverse TFT

PASCO '15: Proceedings of the 2015 International Workshop on Parallel Symbolic Computation

ABSTRACT

References

Cited By

Index Terms

Recommendations

High performance implementation of the TFT

Implementation and Performance of the Fast Hartley Transform

New systolic array implementation of the 2-D discrete cosine transform and its inverse