ACM Home Page
Please provide us with feedback. Feedback
Optimization of MPI collective communication on BlueGene/L systems
Full text PdfPdf (657 KB)
Source International Conference on Supercomputing archive
Proceedings of the 19th annual international conference on Supercomputing table of contents
Cambridge, Massachusetts
SESSION: Session 7: machines table of contents
Pages: 253 - 262  
Year of Publication: 2005
ISBN:1-59593-167-8
Authors
George Almási  IBM T.J. Watson Research Center, Yorktown Heights, NY
Philip Heidelberger  IBM T.J. Watson Research Center, Yorktown Heights, NY
Charles J. Archer  IBM Systems and Technology Group, Rochester, MN
Xavier Martorell  Universitad Politechnica de, Catalunia, Barcelona (Spain)
C. Chris Erway  Brown University, Providence, RI
José E. Moreira  IBM Systems and Technology Group, Rochester, MN
B. Steinmacher-Burow  IBM Germany, Boeblingen, (Germany)
Yili Zheng  Purdue University, West Lafayette, IN
Sponsor
SIGARCH: ACM Special Interest Group on Computer Architecture
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 17,   Downloads (12 Months): 146,   Citation Count: 8
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1088149.1088183
What is a DOI?

ABSTRACT

BlueGene/L is currently the world's fastest supercomputer. It consists of a large number of low power dual-processor compute nodes interconnected by high speed torus and collective networks, Because compute nodes do not have shared memory, MPI is the the natural programming model for this machine. The BlueGene/L MPI library is a port of MPICH2.In this paper we discuss the implementation of MPI collectives on BlueGene/L. The MPICH2 implementation of MPI collectives is based on point-to-point communication primitives. This turns out to be suboptimal for a number of reasons. Machine-optimized MPI collectives are necessary to harness the performance of BlueGene/L. We discuss these optimized MPI collectives, describing the algorithms and presenting performance results measured with targeted micro-benchmarks on real BlueGene/L hardware with up to 4096 compute nodes.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
The MPICH and MPICH2 homepage. http://www-unix.mcs.anl.gov/mpi/mpich.
 
2
 
3
G. Almasi, C. Archer, J. G. Castaños, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Rattermann, N. Smeds, B. Steimacher-burow, W. Gropp, and B. Toonen. Implementing MPI on the BlueGene/L supercomputer. In Proceedings of Euro-Par 2004 Conference, Lecture Notes in Computer Science, Pisa, Italy, August 2004. Springer-Verlag.
 
4
G. Almasi, C. Archer, J. Gunnels, P. Heidelberger, X. Martorell, and J. E. Moreira. Architecture and performance of the BlueGene/L Message Layer. In Proceedings of the 11th EuroPVM/MPI conference, Lecture Notes in Computer Science. Springer-Verlag, September 2004.
 
5
G. Almasi, R. Bellofatto, J. Brunheroto, C. Cascaval, J. G. Castaños, L. Ceze, P. Crumley, C. Erway, J. Gagliano, D. Lieber, X. Martorell, J. E. Moreira, A. Sanomiya, and K. Strauss. An overview of the BlueGene/L system software organization. In Proceedings of Euro-Par 2003 Conference, Lecture Notes in Computer Science, Klagenfurt, Austria, August 2003. Springer-Verlag.
 
6
G. Almasi et al. Cellular supercomputing with system-on-a-chip. In IEEE International Solid-state Circuits Conference ISSCC, 2001.
 
7
M. Barnett, R. J. Littlefield, D. G. Payne, and R. A. van de Geijn. Global combine on mesh architectures with wormhole routing. In International Parallel Processing Symposium, pages 156--162, 1993.
 
8
G. Chiola and G. Ciaccio. Gamma: a low cost network of workstations based on active messages. In Proc. Euromicro PDP'97, London, UK, January 1997, IEEE Computer Society., 1997.
 
9
W. Gropp, E. Lusk, D. Ashton, R. Ross, R. Thakur, and B. Toonen. MPICH Abstract Device Interface Version 3.4 Reference Manual: Draft of May 20, 2003. http://www-unix.mcs.anl.gov/mpi/mpich/adi3/adi3man.pdf.
 
10
S. K. S. Gupta and D. K. Panda. Barrier synchronization in distributed-memory multiprocessors using rendezvous primitives. In Proceedings of the 7th IEEE International Parallel Processing Symposium - IPPS'93. IEEE Press, 1993.
 
11
12
 
13
 
14
R. Rabenseifne. A new optimized mpi reduce algorithm. High-Performance Computing-Center, University of Stuttgart, November 1997. http://www.hlrs.de/mpi/myreduce.html.
 
15
R. Rabenseifner. Optimization of collective reduction operations. In International Conference on Computational Science, June 2004.
 
16
R. Thakur and W. Gropp. Improving the performance of collective operations in mpich. In Proceedings of the 11th EuroPVM/MPI conference. Springer-Verlag, September 2003.
 
17
R. Thakur, R. Rabenseifner, and W. Gropp. Optimization of collective communication operations in mpich. International Journal of High Performance Computing Applications, 2005.
18
19
 
20
J. Watts and R. Van De Geijn. A pipelined broadcast for multidimensional meshes. Parallel Processing Letters, 5(2):281--292, 1995.

CITED BY  8
 
 
 

Collaborative Colleagues:
George Almási: colleagues
Philip Heidelberger: colleagues
Charles J. Archer: colleagues
Xavier Martorell: colleagues
C. Chris Erway: colleagues
José E. Moreira: colleagues
B. Steinmacher-Burow: colleagues
Yili Zheng: colleagues