ABSTRACT
Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.
- T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001. Google ScholarDigital Library
- A. Adelmann, W. P. P. A. Bonelli and, and C. W. Ueberhuber. Communication efficiency of parallel 3d ffts. In High Performance Computing for Computational Science - VECPAR 2004, 6th International Conference, Valencia, Spain, June 28--30, 2004, Revised Selected and Invited Papers, volume 3402 of Lecture Notes in Computer Science, pages 901--907. Springer, 2004.Google Scholar
- F. Baude, D. Caromel, N. Furmento, and D. Sagnol. Optimizing metacomputing with communication-computation overlap. In PaCT '01: Proceedings of the 6th International Conference on Parallel Computing Technologies, pages 190--204, London, UK, 2001. Springer-Verlag. Google ScholarDigital Library
- P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. Operating system issues for petascale systems. SIGOPS Operating System Review, 40(2):29--33, 2006. Google ScholarDigital Library
- C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAC 06), April 2006. Google ScholarDigital Library
- R. Brightwell, R. Riesen, and K. D. Underwood. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl., 19(2):103--117, 2005. Google ScholarDigital Library
- E. D. Brooks. The Butterfly Barrier. International Journal of Parallel Programming, 15(4):295--307, 1986. Google ScholarDigital Library
- BZIP2. http://www.bzip.org, 2006.Google Scholar
- P.-Y. Calland, J. Dongarra, and Y. Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3):139--153, 1999.Google Scholar
- C. E. Cramer and J. A. Board. The development and integration of a distributed 3d fft for a cluster of workstations. In Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta, volume 4. USENIX Association, 2000. Google ScholarDigital Library
- D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993. Google ScholarDigital Library
- A. Dubey and D. Tessera. Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience, 13(3):209--220, 2001.Google ScholarCross Ref
- L. A. Estefanel and G. Mounie. Fast Tuning of Intra-Cluster Collective Communications. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19--22, 2004. Proceedings, 2004.Google Scholar
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.Google ScholarCross Ref
- W. D. Gropp and R. Thakur. Issues in developing a thread-safe mpi implementation. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17--20, 2006, Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 12--21. Springer, 2006. Google ScholarDigital Library
- T. Hoefler, L. Cerquetti, T. Mehlan, F. Mietke, and W. Rehm. A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP '05), pages 562--569, June 2005. Google ScholarDigital Library
- T. Hoefler, P. Gottschling, W. Rehm, and A. Lumsdaine. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In Recent Advantages in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI User's Group Meeting, Proceedings, LNCS 4192, pages 374--382. Springer, 9 2006. Google ScholarDigital Library
- T. Hoefler and A. Lumsdaine. Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, 08 2006.Google Scholar
- T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Fast Barrier Synchronization for InfiniBand. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAG 06), April 2006. Google ScholarDigital Library
- T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, and W. Rehm. Non-Blocking Collective Operations for MPI-2. Technical report, Open Systems Lab, Indiana University, 08 2006.Google Scholar
- T. Hoefler, J. Squyres, W. Rehm, and A. Lumsdaine. A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, volume 4331/2006, pages 155--164. Springer Berlin / Heidelberg, 12 2006. Google ScholarDigital Library
- T. Hoefler, C. Viertel, T. Mehlan, F. Mietke, and W. Rehm. Assessing Single-Message and Multi-Node Communication Performance of InfiniBand. In Proceedings of IEEE Inernational Conference on Parallel Computing in Electrical Engineering, PARELEC 2006, pages 227--232. IEEE Computer Society, 9 2006. Google ScholarDigital Library
- C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05), pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
- IBM. IBM Parallel Environment for AIX, MPI Subroutine Reference, 1993. http://publibfp.boulder.ibm.com/epubs/pdf/a2274230.pdf.Google Scholar
- J. W. III and S. Bova. Where's the Overlap? - An Analysis of Popular MPI Implementations, 1999.Google Scholar
- Intel Corporation. Intel Application Notes - Using the RDTSC Instruction for Performance Monitoring. Technical report, Intel, 1997.Google Scholar
- K. Iskra, P. Beckman, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Proceedings of Cluster Computing, 2006 IEEE International Conference, 2006.Google Scholar
- L. V. Kale, S. Kumar, and K. Vardarajan. A Framework for Collective Personalized Communication. In Proceedings of IPDPS '03, Nice, France, April 2003. Google ScholarDigital Library
- A. Kanevsky, A. Skjellum, and A. Rounbehler. MPI/RT - an emerging standard for high-performance real-time systems. In HICSS (3), pages 157--166, 1998. Google ScholarDigital Library
- W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. Comb: A portable benchmark suite for assessing mpi overlap. In 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 23--26 September 2002, Chicago, IL, USA, pages 472--475. IEEE Computer Society, 2002. Google ScholarDigital Library
- C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. In In ACM Trans. Math. Soft., 5 (1979), pp. 308--323, 1979. Google ScholarDigital Library
- LibNBC. http://www.unixer.de/NBC, 2006.Google Scholar
- G. Liu and T. Abdelrahman. Computation-communication overlap on network-of-workstation multiprocessors. In Proc. of the Int'l Conference on Parallel and Distributed Processing Techniques and Applications, pages 1635--1642, July 1998.Google Scholar
- J. Liu, A. Mamidala, and D. Panda. Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support. Technical report, OSU-CISRC-10/03-TR57, 2003.Google Scholar
- J. Liu, J. Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. Int'l Journal of Parallel Programming, 2004, 2004. Google ScholarDigital Library
- Message Passing Interface Forum. MPI-2 Journal of Development, July 1997.Google Scholar
- F. Mietke, R. Baumgartl, R. Rex, T. Mehlan, T. Hoefler, and W. Rehm. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. 8 2006. Accepted for publication at Euro-Par 2006 Conference. Google ScholarDigital Library
- Myrinet. http://www.myrinet.com, 2006.Google Scholar
- Quadrics. http://www.quadrics.com, 2006.Google Scholar
- R. Rabenseifner. Automatic MPI Counter Profiling. In 42nd CUG Conference, 2000.Google Scholar
- M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers for shared-memory multiprocessors. Int. J. Parallel Program., 22(4):449--481, 1994. Google ScholarDigital Library
- M. Technologies. Infiniband - industry standard data center fabric is ready for prime time. Mellanox White Papers, December 2005.Google Scholar
- S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective communications. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 3, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarDigital Library
- W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), CD-ROM / Abstracts Proceedings, 26--30 April 2004, Santa Fe, New Mexico, USA, 2004.Google Scholar
Recommendations
A case for non-blocking collective operations
ISPA'06: Proceedings of the 2006 international conference on Frontiers of High Performance Computing and NetworkingNon-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of ...
Collective operations in NEC's high-performance MPI libraries
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processingWe give an overview of the algorithms and implementations in the high-performance MPI libraries MPI/SX and MPI/ES of some of the most important collective operations of MPI (the Message Passing Interface). The infrastructure of MPI/SX makes it easy to ...
A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPI
The Gemini interconnect on the Cray XE6 platform provides for lightweight remote direct memory access (RDMA) between nodes, which is useful for implementing partitioned global address space (PGAS) languages like UPC and Co-Array Fortran. In this paper, ...
Comments