skip to main content
10.1145/2503210.2503242acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Published:17 November 2013Publication History

ABSTRACT

This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

References

  1. Intel® Xeon Phi#8482; Coprocessor Instruction Set Architecture Reference Manual.Google ScholarGoogle Scholar
  2. HPC Challenge Benchmark Results. http://icl.cs.utk.edu/hpcc/hpcc_results.cgi.Google ScholarGoogle Scholar
  3. RIKEN Next-Generation Supercomputer R&D Center. http://www.nsc.riken.jp/index-eng.html.Google ScholarGoogle Scholar
  4. Yuichiro Ajima, Yuzo Takagi, Tomohiro Inoue, Shinya Hiramoto, and Toshiyuki Shimizu. The Tofu Interconnect. In Symposium on High Performance Interconnects (HOTI), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. David H. Bailey. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23--35, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing Communication in Numerical Linear Algebra. SIAM Journal of Matrix Analysis and Applications, 32:866--901, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  7. Yifeng Chen, Xiang Cui, and Hong Mei. Large-Scale FFT on GPU clusters. In International Conference on Supercomputing (ICS), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alex Chunghen Chow, Gordon C. Fossum, and Daniel A. Brokenshire. A Programming Example: Large FFT on the Cell Broadband Engine. In Global Signal Processing Expo, 2005.Google ScholarGoogle Scholar
  9. James W. Cooley and John W. Tukey. An Algorithm for the Machine Computation of Complex Fourier Series. Mathematics of Computation, 19(2):297--301, 1965.Google ScholarGoogle Scholar
  10. Jun Doi and Yasushi Negishi. Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Franz Franchetti and Markus Püschel. Encyclopedia of Parallel Computing, chapter Fast Fourier Transform. Springer, 2011.Google ScholarGoogle Scholar
  12. Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa, and José M. F. Moura. Discrete Fourier Transform on Multicore. IEEE Signal Processing Magazine, 26(6):90--102, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  13. Matteo Frigo and Steven G. Johnson. The Design and Implementation of FFTW. Proceedings of the IEEE, 93:216--231, 2005.Google ScholarGoogle Scholar
  14. Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High Performance Discrete fourier Transforms on Graphics Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi#8482; Coprocessor. In IEEE International Parallel and Distributed Processing Systems (IPDPS), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Balint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee, Pradeep Dubey, and William Watson III. Lattice QCD on Intel R Xeon Phi coprocessors. In International Supercomputing Conference (ISC), accepted for publication, 2013.Google ScholarGoogle Scholar
  17. Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. 2008. www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.Google ScholarGoogle Scholar
  18. Charles Van Loan. Computational Frameworks for the Fast Fourier Transforms. SIAM, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream.Google ScholarGoogle Scholar
  20. Chris McClanahan, Kent Czechowski, Casey Battaglino, Kartik Iyer, P.-K. Yeung, and Richard Vuduc. Prospects for scalable 3D FFTs on heterogeneous exascale systems. 2011.Google ScholarGoogle Scholar
  21. R. Al Na'mneh and D. W. Pan. Two-step 1-D fast Fourier transform without inter-processor communications. In Southeastern Symposium on System Theory, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. R. Al Na'mneh, D. W. Pan, and R. Adhami. Parallel implementation of 1-D Fast Fourier Transform without inter-processor communication. In Southeastern Symposium on System Theory, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  23. Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Küger, Aaron E. Lefohn, and Timothy J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, and Thomas Benson. Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In International Conference on Management of Data (SIGMOD), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fengguang Song, Hatem Ltaief, Bilel Hadri, and Jack Dongarra. Scalable Tile Communication-avoiding QR Factorization on Multicore Cluster Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Daisuke Takahashi. A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs, 2000.Google ScholarGoogle Scholar
  29. Daisuke Takahashi. A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Computing, 29:679--690, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato. A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs. In International Euro-Par Conference, number 2400, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daisuke Takahashi, Atsuya Uno, and Mitsuo Yokokawa. An Implementation of Parallel 1-D FFT on the K Computer. In International Conference on High Performance Computing and Communication, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, and Vladimir Petrov. A Framework for Low-Communication 1-D FFT. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
            November 2013
            1123 pages
            ISBN:9781450323789
            DOI:10.1145/2503210
            • General Chair:
            • William Gropp,
            • Program Chair:
            • Satoshi Matsuoka

            Copyright © 2013 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 17 November 2013

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader