skip to main content
10.1145/1362622.1362692acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Implementation and performance analysis of non-blocking collective operations for MPI

Published:10 November 2007Publication History

ABSTRACT

Collective operations and non-blocking point-to-point operations have always been part of MPI. Although non-blocking collective operations are an obvious extension to MPI, there have been no comprehensive studies of this functionality. In this paper we present LibNBC, a portable high-performance library for implementing non-blocking collective MPI communication operations. LibNBC provides non-blocking versions of all MPI collective operations, is layered on top of MPI-1, and is portable to nearly all parallel architectures. To measure the performance characteristics of our implementation, we also present a microbenchmark for measuring both latency and overlap of computation and communication. Experimental results demonstrate that the blocking performance of the collective operations in our library is comparable to that of collective operations in other high-performance MPI implementations. Our library introduces a very low overhead between the application and the underlying MPI and thus, in conjunction with the potential to overlap communication with computation, offers the potential for optimizing real-world applications.

References

  1. T. S. Abdelrahman and G. Liu. Overlap of computation and communication on shared-memory networks-of-workstations. Cluster computing, pages 35--45, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Adelmann, W. P. P. A. Bonelli and, and C. W. Ueberhuber. Communication efficiency of parallel 3d ffts. In High Performance Computing for Computational Science - VECPAR 2004, 6th International Conference, Valencia, Spain, June 28--30, 2004, Revised Selected and Invited Papers, volume 3402 of Lecture Notes in Computer Science, pages 901--907. Springer, 2004.Google ScholarGoogle Scholar
  3. F. Baude, D. Caromel, N. Furmento, and D. Sagnol. Optimizing metacomputing with communication-computation overlap. In PaCT '01: Proceedings of the 6th International Conference on Parallel Computing Technologies, pages 190--204, London, UK, 2001. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Beckman, K. Iskra, K. Yoshii, and S. Coghlan. Operating system issues for petascale systems. SIGOPS Operating System Review, 40(2):29--33, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. Bell, D. Bonachea, R. Nishtala, and K. Yelick. Optimizing Bandwidth Limited Problems Using One-Sided Communication and Overlap. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAC 06), April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Brightwell, R. Riesen, and K. D. Underwood. Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl., 19(2):103--117, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. D. Brooks. The Butterfly Barrier. International Journal of Parallel Programming, 15(4):295--307, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. BZIP2. http://www.bzip.org, 2006.Google ScholarGoogle Scholar
  9. P.-Y. Calland, J. Dongarra, and Y. Robert. Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience, 11(3):139--153, 1999.Google ScholarGoogle Scholar
  10. C. E. Cramer and J. A. Board. The development and integration of a distributed 3d fft for a cluster of workstations. In Proceedings of the 4th Annual Linux Showcase & Conference, Atlanta, volume 4. USENIX Association, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Culler, R. Karp, D. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: towards a realistic model of parallel computation. In Principles Practice of Parallel Programming, pages 1--12, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dubey and D. Tessera. Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience, 13(3):209--220, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  13. L. A. Estefanel and G. Mounie. Fast Tuning of Intra-Cluster Collective Communications. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 11th European PVM/MPI Users Group Meeting Budapest, Hungary, September 19--22, 2004. Proceedings, 2004.Google ScholarGoogle Scholar
  14. E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation. In Proceedings, 11th European PVM/MPI Users' Group Meeting, Budapest, Hungary, September 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. W. D. Gropp and R. Thakur. Issues in developing a thread-safe mpi implementation. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface, 13th European PVM/MPI User's Group Meeting, Bonn, Germany, September 17--20, 2006, Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 12--21. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Hoefler, L. Cerquetti, T. Mehlan, F. Mietke, and W. Rehm. A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP '05), pages 562--569, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hoefler, P. Gottschling, W. Rehm, and A. Lumsdaine. Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In Recent Advantages in Parallel Virtual Machine and Message Passing Interface. 13th European PVM/MPI User's Group Meeting, Proceedings, LNCS 4192, pages 374--382. Springer, 9 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Hoefler and A. Lumsdaine. Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University, 08 2006.Google ScholarGoogle Scholar
  19. T. Hoefler, T. Mehlan, F. Mietke, and W. Rehm. Fast Barrier Synchronization for InfiniBand. In Proceedings, 20th International Parallel and Distributed Processing Symposium IPDPS 2006 (CAG 06), April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Hoefler, J. Squyres, G. Bosilca, G. Fagg, A. Lumsdaine, and W. Rehm. Non-Blocking Collective Operations for MPI-2. Technical report, Open Systems Lab, Indiana University, 08 2006.Google ScholarGoogle Scholar
  21. T. Hoefler, J. Squyres, W. Rehm, and A. Lumsdaine. A Case for Non-Blocking Collective Operations. In Frontiers of High Performance Computing and Networking - ISPA 2006 Workshops, volume 4331/2006, pages 155--164. Springer Berlin / Heidelberg, 12 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Hoefler, C. Viertel, T. Mehlan, F. Mietke, and W. Rehm. Assessing Single-Message and Multi-Node Communication Performance of InfiniBand. In Proceedings of IEEE Inernational Conference on Parallel Computing in Electrical Engineering, PARELEC 2006, pages 227--232. IEEE Computer Society, 9 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Iancu, P. Husbands, and P. Hargrove. Hunting the overlap. In PACT '05: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT '05), pages 279--290, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. IBM. IBM Parallel Environment for AIX, MPI Subroutine Reference, 1993. http://publibfp.boulder.ibm.com/epubs/pdf/a2274230.pdf.Google ScholarGoogle Scholar
  25. J. W. III and S. Bova. Where's the Overlap? - An Analysis of Popular MPI Implementations, 1999.Google ScholarGoogle Scholar
  26. Intel Corporation. Intel Application Notes - Using the RDTSC Instruction for Performance Monitoring. Technical report, Intel, 1997.Google ScholarGoogle Scholar
  27. K. Iskra, P. Beckman, K. Yoshii, and S. Coghlan. The influence of operating systems on the performance of collective operations at extreme scale. In Proceedings of Cluster Computing, 2006 IEEE International Conference, 2006.Google ScholarGoogle Scholar
  28. L. V. Kale, S. Kumar, and K. Vardarajan. A Framework for Collective Personalized Communication. In Proceedings of IPDPS '03, Nice, France, April 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. A. Kanevsky, A. Skjellum, and A. Rounbehler. MPI/RT - an emerging standard for high-performance real-time systems. In HICSS (3), pages 157--166, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Lawry, C. Wilson, A. B. Maccabe, and R. Brightwell. Comb: A portable benchmark suite for assessing mpi overlap. In 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 23--26 September 2002, Chicago, IL, USA, pages 472--475. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. L. Lawson, R. J. Hanson, D. Kincaid, and F. T. Krogh. Basic Linear Algebra Subprograms for FORTRAN usage. In In ACM Trans. Math. Soft., 5 (1979), pp. 308--323, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. LibNBC. http://www.unixer.de/NBC, 2006.Google ScholarGoogle Scholar
  33. G. Liu and T. Abdelrahman. Computation-communication overlap on network-of-workstation multiprocessors. In Proc. of the Int'l Conference on Parallel and Distributed Processing Techniques and Applications, pages 1635--1642, July 1998.Google ScholarGoogle Scholar
  34. J. Liu, A. Mamidala, and D. Panda. Fast and Scalable MPI-Level Broadcast using InfiniBand's Hardware Multicast Support. Technical report, OSU-CISRC-10/03-TR57, 2003.Google ScholarGoogle Scholar
  35. J. Liu, J. Wu, and D. K. Panda. High Performance RDMA-Based MPI Implementation over InfiniBand. Int'l Journal of Parallel Programming, 2004, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Message Passing Interface Forum. MPI-2 Journal of Development, July 1997.Google ScholarGoogle Scholar
  37. F. Mietke, R. Baumgartl, R. Rex, T. Mehlan, T. Hoefler, and W. Rehm. Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack. 8 2006. Accepted for publication at Euro-Par 2006 Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Myrinet. http://www.myrinet.com, 2006.Google ScholarGoogle Scholar
  39. Quadrics. http://www.quadrics.com, 2006.Google ScholarGoogle Scholar
  40. R. Rabenseifner. Automatic MPI Counter Profiling. In 42nd CUG Conference, 2000.Google ScholarGoogle Scholar
  41. M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers for shared-memory multiprocessors. Int. J. Parallel Program., 22(4):449--481, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. M. Technologies. Infiniband - industry standard data center fabric is ready for prime time. Mellanox White Papers, December 2005.Google ScholarGoogle Scholar
  43. S. S. Vadhiyar, G. E. Fagg, and J. Dongarra. Automatically tuned collective communications. In Supercomputing '00: Proceedings of the 2000 ACM/IEEE conference on Supercomputing (CDROM), page 3, Washington, DC, USA, 2000. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. W. Yu, D. Buntinas, R. L. Graham, and D. K. Panda. Efficient and scalable barrier over quadrics and myrinet with a new nic-based collective message passing protocol. In 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), CD-ROM / Abstracts Proceedings, 26--30 April 2004, Santa Fe, New Mexico, USA, 2004.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing
    November 2007
    723 pages
    ISBN:9781595937643
    DOI:10.1145/1362622

    Copyright © 2007 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 10 November 2007

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    SC '07 Paper Acceptance Rate54of268submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader