ABSTRACT
This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.
- Intel® Xeon Phi#8482; Coprocessor Instruction Set Architecture Reference Manual.Google Scholar
- HPC Challenge Benchmark Results. http://icl.cs.utk.edu/hpcc/hpcc_results.cgi.Google Scholar
- RIKEN Next-Generation Supercomputer R&D Center. http://www.nsc.riken.jp/index-eng.html.Google Scholar
- Yuichiro Ajima, Yuzo Takagi, Tomohiro Inoue, Shinya Hiramoto, and Toshiyuki Shimizu. The Tofu Interconnect. In Symposium on High Performance Interconnects (HOTI), 2011. Google ScholarDigital Library
- David H. Bailey. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23--35, 1990. Google ScholarDigital Library
- Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing Communication in Numerical Linear Algebra. SIAM Journal of Matrix Analysis and Applications, 32:866--901, 2012.Google ScholarCross Ref
- Yifeng Chen, Xiang Cui, and Hong Mei. Large-Scale FFT on GPU clusters. In International Conference on Supercomputing (ICS), 2010. Google ScholarDigital Library
- Alex Chunghen Chow, Gordon C. Fossum, and Daniel A. Brokenshire. A Programming Example: Large FFT on the Cell Broadband Engine. In Global Signal Processing Expo, 2005.Google Scholar
- James W. Cooley and John W. Tukey. An Algorithm for the Machine Computation of Complex Fourier Series. Mathematics of Computation, 19(2):297--301, 1965.Google Scholar
- Jun Doi and Yasushi Negishi. Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
- Franz Franchetti and Markus Püschel. Encyclopedia of Parallel Computing, chapter Fast Fourier Transform. Springer, 2011.Google Scholar
- Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa, and José M. F. Moura. Discrete Fourier Transform on Multicore. IEEE Signal Processing Magazine, 26(6):90--102, 2009.Google ScholarCross Ref
- Matteo Frigo and Steven G. Johnson. The Design and Implementation of FFTW. Proceedings of the IEEE, 93:216--231, 2005.Google Scholar
- Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High Performance Discrete fourier Transforms on Graphics Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2008. Google ScholarDigital Library
- Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi#8482; Coprocessor. In IEEE International Parallel and Distributed Processing Systems (IPDPS), 2013. Google ScholarDigital Library
- Balint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee, Pradeep Dubey, and William Watson III. Lattice QCD on Intel R Xeon Phi coprocessors. In International Supercomputing Conference (ISC), accepted for publication, 2013.Google Scholar
- Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. 2008. www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.Google Scholar
- Charles Van Loan. Computational Frameworks for the Fast Fourier Transforms. SIAM, 1992. Google ScholarDigital Library
- John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream.Google Scholar
- Chris McClanahan, Kent Czechowski, Casey Battaglino, Kartik Iyer, P.-K. Yeung, and Richard Vuduc. Prospects for scalable 3D FFTs on heterogeneous exascale systems. 2011.Google Scholar
- R. Al Na'mneh and D. W. Pan. Two-step 1-D fast Fourier transform without inter-processor communications. In Southeastern Symposium on System Theory, 2006.Google ScholarCross Ref
- R. Al Na'mneh, D. W. Pan, and R. Adhami. Parallel implementation of 1-D Fast Fourier Transform without inter-processor communication. In Southeastern Symposium on System Theory, 2005.Google ScholarCross Ref
- Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Küger, Aaron E. Lefohn, and Timothy J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
- Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, and Thomas Benson. Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
- Fengguang Song, Hatem Ltaief, Bilel Hadri, and Jack Dongarra. Scalable Tile Communication-avoiding QR Factorization on Multicore Cluster Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
- Daisuke Takahashi. A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs, 2000.Google Scholar
- Daisuke Takahashi. A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Computing, 29:679--690, 2003. Google ScholarDigital Library
- Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato. A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs. In International Euro-Par Conference, number 2400, 2002. Google ScholarDigital Library
- Daisuke Takahashi, Atsuya Uno, and Mitsuo Yokokawa. An Implementation of Parallel 1-D FFT on the K Computer. In International Conference on High Performance Computing and Communication, 2012. Google ScholarDigital Library
- Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, and Vladimir Petrov. A Framework for Low-Communication 1-D FFT. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
- Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. Google ScholarDigital Library
Index Terms
- Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
Recommendations
Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumIntel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors
In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce ...
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous ArchitecturesEfficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
Comments