research-article

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

Authors:
Jongsoo Park

Parallel Computing Lab

Parallel Computing Lab
View Profile

,
Ganesh Bikshandi

Parallel Computing Lab

Parallel Computing Lab
View Profile

,
Karthikeyan Vaidyanathan

Parallel Computing Lab

Parallel Computing Lab
View Profile

,
Ping Tak Peter Tang

Software and Service Group, Intel Corporation

Software and Service Group, Intel Corporation
View Profile

,
Pradeep Dubey

Parallel Computing Lab

Parallel Computing Lab
View Profile

,
Daehyun Kim

Parallel Computing Lab

Parallel Computing Lab
View Profile

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and AnalysisNovember 2013Article No.: 34Pages 1–12https://doi.org/10.1145/2503210.2503242

Published:17 November 2013Publication History

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

Pages 1–12

ABSTRACT

This paper demonstrates the first tera-scale performance of Intel® Xeon Phi™ coprocessors on 1D FFT computations. Applying a disciplined performance programming methodology of sound algorithm choice, valid performance model, and well-executed optimizations, we break the tera-flop mark on a mere 64 nodes of Xeon Phi and reach 6.7 TFLOPS with 512 nodes, which is 1.5x than achievable on a same number of Intel® Xeon® nodes. It is a challenge to fully utilize the compute capability presented by many-core wide-vector processors for bandwidth-bound FFT computation. We leverage a new algorithm, Segment-of-Interest FFT, with low inter-node communication cost, and aggressively optimize data movements in node-local computations, exploiting caches. Our coordination of low communication algorithm and massively parallel architecture for scalable performance is not limited to running FFT on Xeon Phi; it can serve as a reference for other bandwidth-bound computations and for emerging HPC systems that are increasingly communication limited.

References

Intel® Xeon Phi#8482; Coprocessor Instruction Set Architecture Reference Manual.Google Scholar
HPC Challenge Benchmark Results. http://icl.cs.utk.edu/hpcc/hpcc_results.cgi.Google Scholar
RIKEN Next-Generation Supercomputer R&D Center. http://www.nsc.riken.jp/index-eng.html.Google Scholar
Yuichiro Ajima, Yuzo Takagi, Tomohiro Inoue, Shinya Hiramoto, and Toshiyuki Shimizu. The Tofu Interconnect. In Symposium on High Performance Interconnects (HOTI), 2011. Google ScholarDigital Library
David H. Bailey. FFTs in External or Hierarchical Memory. Journal of Supercomputing, 4(1):23--35, 1990. Google ScholarDigital Library
Grey Ballard, James Demmel, Olga Holtz, and Oded Schwartz. Minimizing Communication in Numerical Linear Algebra. SIAM Journal of Matrix Analysis and Applications, 32:866--901, 2012.Google ScholarCross Ref
Yifeng Chen, Xiang Cui, and Hong Mei. Large-Scale FFT on GPU clusters. In International Conference on Supercomputing (ICS), 2010. Google ScholarDigital Library
Alex Chunghen Chow, Gordon C. Fossum, and Daniel A. Brokenshire. A Programming Example: Large FFT on the Cell Broadband Engine. In Global Signal Processing Expo, 2005.Google Scholar
James W. Cooley and John W. Tukey. An Algorithm for the Machine Computation of Complex Fourier Series. Mathematics of Computation, 19(2):297--301, 1965.Google Scholar
Jun Doi and Yasushi Negishi. Overlapping Methods of All-to-All Communication and FFT Algorithms for Torus-Connected Massively Parallel Supercomputers. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
Franz Franchetti and Markus Püschel. Encyclopedia of Parallel Computing, chapter Fast Fourier Transform. Springer, 2011.Google Scholar
Franz Franchetti, Markus Püschel, Yevgen Voronenko, Srinivas Chellappa, and José M. F. Moura. Discrete Fourier Transform on Multicore. IEEE Signal Processing Magazine, 26(6):90--102, 2009.Google ScholarCross Ref
Matteo Frigo and Steven G. Johnson. The Design and Implementation of FFTW. Proceedings of the IEEE, 93:216--231, 2005.Google Scholar
Naga K. Govindaraju, Brandon Lloyd, Yuri Dotsenko, Burton Smith, and John Manferdelli. High Performance Discrete fourier Transforms on Graphics Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2008. Google ScholarDigital Library
Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, Aniruddha G Shet, George Chrysos, and Pradeep Dubey. Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi#8482; Coprocessor. In IEEE International Parallel and Distributed Processing Systems (IPDPS), 2013. Google ScholarDigital Library
Balint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee, Pradeep Dubey, and William Watson III. Lattice QCD on Intel R Xeon Phi coprocessors. In International Supercomputing Conference (ISC), accepted for publication, 2013.Google Scholar
Peter Kogge, Keren Bergman, Shekhar Borkar, Dan Campbell, William Carlson, William Dally, Monty Denneau, Paul Franzon, William Harrod, Kerry Hill, Jon Hiller, Sherman Karp, Stephen Keckler, Dean Klein, Robert Lucas, Mark Richards, Al Scarpelli, Steven Scott, Allan Snavely, Thomas Sterling, R. Stanley Williams, and Katherine Yelick. ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems. 2008. www.cse.nd.edu/Reports/2008/TR-2008-13.pdf.Google Scholar
Charles Van Loan. Computational Frameworks for the Fast Fourier Transforms. SIAM, 1992. Google ScholarDigital Library
John D. McCalpin. STREAM: Sustainable Memory Bandwidth in High Performance Computers. http://www.cs.virginia.edu/stream.Google Scholar
Chris McClanahan, Kent Czechowski, Casey Battaglino, Kartik Iyer, P.-K. Yeung, and Richard Vuduc. Prospects for scalable 3D FFTs on heterogeneous exascale systems. 2011.Google Scholar
R. Al Na'mneh and D. W. Pan. Two-step 1-D fast Fourier transform without inter-processor communications. In Southeastern Symposium on System Theory, 2006.Google ScholarCross Ref
R. Al Na'mneh, D. W. Pan, and R. Adhami. Parallel implementation of 1-D Fast Fourier Transform without inter-processor communication. In Southeastern Symposium on System Theory, 2005.Google ScholarCross Ref
Akira Nukada, Kento Sato, and Satoshi Matsuoka. Scalable Multi-GPU 3-D FFT for TSUBAME 2.0 Supercomputer. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Küger, Aaron E. Lefohn, and Timothy J. Purcell. A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, and Thomas Benson. Efficient Backprojection-based Synthetic Aperture Radar Computation with Many-core Processors. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast Sort on CPUs and GPUs: A Case for Bandwidth Oblivious SIMD Sort. In International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
Fengguang Song, Hatem Ltaief, Bilel Hadri, and Jack Dongarra. Scalable Tile Communication-avoiding QR Factorization on Multicore Cluster Systems. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
Daisuke Takahashi. A Parallel 3-D FFT Algorithm on Clusters of Vector SMPs, 2000.Google Scholar
Daisuke Takahashi. A parallel 1-D FFT algorithm for the Hitachi SR8000. Parallel Computing, 29:679--690, 2003. Google ScholarDigital Library
Daisuke Takahashi, Taisuke Boku, and Mitsuhisa Sato. A Blocking Algorithm for Parallel 1-D FFT on Clusters of PCs. In International Euro-Par Conference, number 2400, 2002. Google ScholarDigital Library
Daisuke Takahashi, Atsuya Uno, and Mitsuo Yokokawa. An Implementation of Parallel 1-D FFT on the K Computer. In International Conference on High Performance Computing and Communication, 2012. Google ScholarDigital Library
Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, and Vladimir Petrov. A Framework for Low-Communication 1-D FFT. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2012. Google ScholarDigital Library
Michael Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley, 1996. Google ScholarDigital Library

Index Terms

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
  2. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages
        Parallel programming languages

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

Intel® Xeon Phi™ coprocessor is based on the Intel® Many Integrated Core (Intel® MIC) architecture, which is an innovative new processor architecture that combines abundant thread parallelism with long SIMD vector units. Efficiently exploiting SIMD ...
Read More
MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors
In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce ...
Read More
Effective SIMD vectorization for intel Xeon Phi coprocessors
Special issue on Programming Models, Languages, and Compilers for Manycore and Heterogeneous Architectures

Efficiently exploiting SIMD vector units is one of the most important aspects in achieving high performance of the application code running on Intel Xeon Phi coprocessors. In this paper, we present several effective SIMD vectorization techniques such as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
November 2013
1123 pages
ISBN:9781450323789
DOI:10.1145/2503210
General Chair:
William Gropp
University of Illinois at Urbana-Champaign, Urbana, Illinois
,
Program Chair:
Satoshi Matsuoka
Tokyo Institute of Technology, Tokyo, Japan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
FFT
Xeon Phi
bandwidth optimizations
communication-avoiding algorithms
wide-vector many-core processors
Qualifiers
- research-article
Conference

Acceptance Rates
SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 481
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Tera-scale 1D FFT with low-communication algorithm and Intel® Xeon Phi™ coprocessors

SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

ABSTRACT

References

Cited By

Index Terms

Recommendations

Practical SIMD Vectorization Techniques for Intel® Xeon Phi Coprocessors

MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors

Effective SIMD vectorization for intel Xeon Phi coprocessors