|
ABSTRACT
BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
The MPICH and MPICH2 homepage. http://www-unix.mcs.anl.gov/mpi/mpich.
|
| |
2
|
NR Adiga , G Almasi , GS Almasi , Y Aridor , R Barik , D Beece , R Bellofatto , G Bhanot , R Bickford , M Blumrich , AA Bright , J Brunheroto , C Caşcaval , J Castaños , W Chan , L Ceze , P Coteus , S Chatterjee , D Chen , G Chiu , TM Cipolla , P Crumley , KM Desai , A Deutsch , T Domany , MB Dombrowa , W Donath , M Eleftheriou , C Erway , J Esch , B Fitch , J Gagliano , A Gara , R Garg , R Germain , ME Giampapa , B Gopalsamy , J Gunnels , M Gupta , F Gustavson , S Hall , RA Haring , D Heidel , P Heidelberger , LM Herger , D Hoenicke , RD Jackson , T Jamal-Eddine , GV Kopcsay , E Krevat , MP Kurhekar , AP Lanzetta , D Lieber , LK Liu , M Lu , M Mendell , A Misra , Y Moatti , L Mok , JE Moreira , BJ Nathanson , M Newton , M Ohmacht , A Oliner , V Pandit , RB Pudota , R Rand , R Regan , B Rubin , A Ruehli , S Rus , RK Sahoo , A Sanomiya , E Schenfeld , M Sharma , E Shmueli , S Singh , P Song , V Srinivasan , BD Steinmacher-Burow , K Strauss , C Surovic , R Swetz , T Takken , RB Tremaine , M Tsao , AR Umamaheshwaran , P Verma , P Vranas , TJC Ward , M Wazlowski , W Barrett , C Engel , B Drehmel , B Hilgart , D Hill , F Kasemkhani , D Krolak , CT Li , T Liebsch , J Marcella , A Muff , A Okomo , M Rouse , A Schram , M Tubbs , G Ulsh , C Wait , J Wittrup , M Bae , K Dockser , L Kissel , MK Seager , JS Vetter , K Yates, An overview of the BlueGene/L Supercomputer, Proceedings of the 2002 ACM/IEEE conference on Supercomputing, p.1-22, November 16, 2002, Baltimore, Maryland
|
| |
3
|
G. Almasi, C. Archer, J. G. Castaños, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Rattermann, N. Smeds, B. Steimacher-burow, W. Gropp, and B. Toonen. Implementing MPI on the BlueGene/L supercomputer. In Proceedings of Euro-Par 2004 Conference, Lecture Notes in Computer Science, Pisa, Italy, August 2004. Springer-Verlag.
|
| |
4
|
G. Almasi, C. Archer, J. Gunnels, P. Heidelberger, X. Martorell, and J. E. Moreira. Architecture and performance of the BlueGene/L Message Layer. In Proceedings of the 11th EuroPVM/MPI conference, Lecture Notes in Computer Science. Springer-Verlag, September 2004.
|
| |
5
|
G. Almasi, R. Bellofatto, J. Brunheroto, C. Cascaval, J. G. Castaños, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.
|
| |
6
|
G. Almasi et al. Cellular supercomputing with system-on-a-chip. In IEEE International Solid-state Circuits Conference ISSCC, 2001.
|
| |
7
|
M. Barnett, R. J. Littlefield, D. G. Payne, and R. A. van de Geijn. Global combine on mesh architectures with wormhole routing. In International Parallel Processing Symposium, pages 156--162, 1993.
|
| |
8
|
G. Chiola and G. Ciaccio. Gamma: a low cost network of workstations based on active messages. In Proc. Euromicro PDP'97, London, UK, January 1997, IEEE Computer Society., 1997.
|
| |
9
|
W. Gropp, E. Lusk, D. Ashton, R. Ross, R. Thakur, and B. Toonen. MPICH Abstract Device Interface Version 3.4 Reference Manual: Draft of May 20, 2003. http://www-unix.mcs.anl.gov/mpi/mpich/adi3/adi3man.pdf.
|
| |
10
|
S. K. S. Gupta and D. K. Panda. Barrier synchronization in distributed-memory multiprocessors using rendezvous primitives. In Proceedings of the 7th IEEE International Parallel Processing Symposium - IPPS'93. IEEE Press, 1993.
|
| |
11
|
|
 |
12
|
Scott Pakin , Mario Lauria , Andrew Chien, High Performance Messaging on Workstations: Illinois Fast Messages (FM) for Myrinet, Proceedings of the 1995 ACM/IEEE conference on Supercomputing (CDROM), p.55.1, December 04-08, 1995, San Diego, California, United States
|
| |
13
|
|
| |
14
|
R. Rabenseifne. A new optimized mpi reduce algorithm. High-Performance Computing-Center, University of Stuttgart, November 1997. http://www.hlrs.de/mpi/myreduce.html.
|
| |
15
|
R. Rabenseifner. Optimization of collective reduction operations. In International Conference on Computational Science, June 2004.
|
| |
16
|
R. Thakur and W. Gropp. Improving the performance of collective operations in mpich. In Proceedings of the 11th EuroPVM/MPI conference. Springer-Verlag, September 2003.
|
| |
17
|
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 2005.
|
 |
18
|
T. von Eicken , A. Basu , V. Buch , W. Vogels, U-Net: a user-level network interface for parallel and distributed computing (includes URL), Proceedings of the fifteenth ACM symposium on Operating systems principles, p.40-53, December 03-06, 1995, Copper Mountain, Colorado, United States
|
 |
19
|
Thorsten von Eicken , David E. Culler , Seth Copen Goldstein , Klaus Erik Schauser, Active messages: a mechanism for integrated communication and computation, Proceedings of the 19th annual international symposium on Computer architecture, p.256-266, May 19-21, 1992, Queensland, Australia
|
| |
20
|
J. Watts and R. Van De Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.
|
CITED BY 8
|
|
|
|
|
|
|
|
|
|
|
|
Maria Eleftheriou , Blake G. Fitch , Aleksandr Rayshubskiy , T.J. Christopher Ward , Phillip Heidelberger , Robert S. Germain, A study of the effects of machine geometry and mapping on distributed transpose performance, Proceedings of the 2008 conference on Computing frontiers, May 05-07, 2008, Ischia, Italy
|
|
|
|
|
|
J. Kozloski , K. Sfyrakis , S. Hill , F. Schürmann , C. Peck , H. Markram, Identifying, tabulating, and analyzing contacts between branched neuron morphologies, IBM Journal of Research and Development, v.52 n.1, p.43-55, January 2008
|
|