research-article

Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Authors:

Keith D. Underwood,

Michael J. Levenhagen,

Ron BrightwellAuthors Info & Claims

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

Article No.: 36, Pages 1 - 10

https://doi.org/10.1145/1362622.1362671

Published: 10 November 2007 Publication History

Abstract

Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark.

References

[1]

E. Anderson, J. Brooks, C. Grassl, and S. Scott. Performance of the Cray T3E multiprocessor. In 1997 ACM/IEE Supercomputing Conference (SC '97), November 1997.

Digital Library

[2]

M. Blumrich, K. Li, R. Alpert, C. Dubnicki, and E. Felten. Virtual memory mapped network interface for the SHRIMP multicomputer. In 21st Annual International Symposium on Computer Architecture, pages 142--153, Chicago, Illinois, USA, Apr. 1994.

Digital Library

[3]

N. J. Boden, D. Cohen, R. E. E. A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. Su. Myrinet: A gigabit-per-second local area network. IEEE Micro, 15(1):29--36, Feb. 1995.

Digital Library

[4]

D. Bonachea. Gasnet specification, vl.1. Technical Report UCB/CSD-02-1207, October 2002.

Digital Library

[5]

R. Brightwell, D. Doerfler, and K. D. Underwood. A preliminary analysis of the InfiniPath and XDI network interfaces. In 20th International Parallel and Distributed Processing Symposium (IPDPS '06) Workshop on Communication Architectures for Clusters, April 2006.

Digital Library

[6]

D. Burger and T. Austin. The SimpleScalar Tool Set. Version 2.0. SimpleScalar LLC.

[7]

D. Callahan, B. L. Chamberlain, and H. P. Zima. The Cascade high productivity language. In Ninth IEEE International Workshop on High-Level Parallel Programming Models and Supportive Environments (HIPS 2004), pages 52--60, April 2004.

[8]

J. Carbonaro and F. Verhoorn. Cavallino: The Teraflops router and NIC. In Fourth IEEE Symposium on High-Performance Interconnects (Hotl '96), August 1996.

[9]

W. W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren. Introduction to UPC and language specification. Technical Report CCS-TR-99-157, May 1999.

[10]

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar. X10: An object-oriented approach to non-uniform cluster computing. In Twentieth ACM SIGPLAN Conference on Object Oriented Programming, Systems, Languages, and Applications, pages 519--538, October 2005.

Digital Library

[11]

Cray, Inc. Cray XIE supercomputer. http://www.cray.com/products/systems/xi.

[12]

Cray Research, Inc. SHMEM Technical Note for C, SG-2516 2.3, October 1994.

[13]

L. Dickman, G. Lindahl, D. Olson, J. Rubin, and J. Broughton. PathScale InfiniPath: A first look. In Proceedings of the 13th Symposium on High Performance Interconnects (HOTI '05), August 2005.

Digital Library

[14]

R. Garg and Y. Sabharwal. Software routing and aggregation of messages to optimize the performance of the HPCC Randomaccess benchmark. In 2006 ACM/IEEE International Conference for High-Performance Computing, Networking. Storage, and Analysis (SC '06), November 2006.

Digital Library

[15]

H. Hellwagner and A. Reinefeld, editors.SCI: Scalable Coherent Interface: Architecture andxo Software for High-Performance Compute Clusters, volume 1734 of Lecture Notes in Computer Science. Springer, 1999.

Digital Library

[16]

Infiniband Trade Association. http://www.infinibandta.org, 1999.

[17]

S. M. Kelly and R. Brightwell. Software architecture of the light weight kernel, Catamount. In Proceedings of the 2005 Cray User Group Annual Technical Conference, May 2005.

[18]

J. Liu and D. K. Panda. Implementing efficient and scalable flow control schemes in MPI over InfiniBand. In 2004 Workshop on Communication Architecture for Clusters (CAC '04), April 2004.

[19]

P. Luszczek, J. Dongarra, D. Koester, R. Rabenseifner, R. Lucas, J. Kepner, J. McCalpin, D. Bailey, and D. Takahashi. Introduction to the HPC challenge benchmark suite, March 2005. http://icl.cs.utk.edu/hpcc/pubs/index.html.

Digital Library

[20]

D. Mayhew and V. Krishnan. PCI Express and Advanced Switching: Evolutionary path to building next generation interconnects. In Eleventh IEEE Symposium on High-Performance Interconnects (Hotl '04), August 2004.

[21]

Mellanox, Inc. New Mellanox ConnectX IB adapters unleash multi-core processor performance, http://www.mellanox.com/news/press_releases/pr_032607.php.

[22]

J. Nieplocha and B. Carpenter. ARMCI: A Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-Time Systems, volume 1586, pages 533--546. Springer, 1999.

Digital Library

[23]

J. Nieplocha and R. Harrison. Shared-memory programming in metacomputing environments: The Global Array approach. The Journal of Supercomputing, 11:119--136, 1997.

Digital Library

[24]

R. W. Numrich and J. Reid. Co-array Fortran for parallel programming. ACM SIGPLAN Fortran Forum, 17(2):1--31, August 1998.

Digital Library

[25]

F. Petrini, W. chun Feng, A. Hoisie, S. Coll, and E. Frachtenberg. The Quadrics network: High-performance clustering technology. IEEE Micro, 22(1):46--57, January/February 2002.

Digital Library

[26]

S. Plimpton, R. Brightwell, C. Vaughan, K. Underwood, and M. Davis. A simple synchronous distributed-memory algorithm for the HPCC RandomAccess benchmark. In 2006 IEEE International Conference on Cluster Computing, September 2006.

[27]

QLogic, Inc. InfiniPath interconnect performance. http://www.pathscale.com/infinipath-perf.html.

[28]

Quadrics, Inc. QSNet-II performance results. http://www.quadrics.com/.

[29]

D. Roweth and A. Pittman. Optimised global reduction on QsNet-II. In Thirteenth IEEE Symposium on High-Performance Interconnects (Hotl '05), August 2005.

Digital Library

[30]

S. L. Scott. Synchronization and communication in the T3E multiprocessor. In Seventh ACM International Conference on Architectural Support for Programming Languages and Operating Systems, October 1996.

Digital Library

[31]

G. L. Steele, Jr. Parallel programming and code selection in Fortress. In Eleventh ACM Symposium on Principles and Practice of Parallel Programming, March 2006.

Digital Library

[32]

K. Underwood. Challenges and issues in benchmarking MPI. In B. Mohr, J. L. Träff, J. Worringen, and J. Dongarra, editors, Recent Advances in Parallel Virtual Machine and Message Passing Interface: 13th European PVM/MPI Users' Group Meeting, Bonn, Germany, September 2006 Proceedings, volume 4192 of Lecture Notes in Computer Science, pages 339--346. Springer-Verlag, 2006.

Digital Library

[33]

K. D. Underwood, M. Levenhagen, and A. Rodrigues. Simulating Red Storm: Challenges and successes in building a system simulation. In 21st International Parallel and Distributed Processing Symposium (IPDPS '07), March 2007.

[34]

T. von Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: a mechanism for integrated communication and computation. In Proceedings of the 19th annual International Symposium on Computer Architecture, pages 256--266, May 1992.

Digital Library

Cited By

Dinan JGrant RBalaji PGoodell DMiller DSnir MThakur R(2014)Enabling communication concurrency through flexible MPI endpointsThe International Journal of High Performance Computing Applications10.1177/109434201454877228:4(390-405)Online publication date: 23-Sep-2014
https://doi.org/10.1177/1094342014548772
Wen KCalhoun DRumley SZhu XLiu YLuo LDing RJones THochberg MLipson MBergman K(2014)Reuse Distance Based Circuit Replacement in Silicon Photonic Interconnection Networks for HPCProceedings of the 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects10.1109/HOTI.2014.20(49-56)Online publication date: 26-Aug-2014
https://dl.acm.org/doi/10.1109/HOTI.2014.20
Fröning HNüssle MLitz HLeber CBrüning UEpema D(2013)On achieving high message ratesProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.43(498-505)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1109/CCGrid.2013.43
Show More Cited By

Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Low-level PGAS computing on many-core processors with TSHMEM

Diminishing returns from increased clock frequencies and instruction-level parallelism have forced computer architects to adopt architectures that exploit wider parallelism through multiple processor cores. While emerging many-core architectures have ...
Hardware support for address mapping in PGAS languages: a UPC case study
CF '14: Proceedings of the 11th ACM Conference on Computing Frontiers

The Partitioned Global Address Space (PGAS) programming model strikes a balance between the explicit, locality-aware, message-passing model and locality-agnostic, but easy-to-use, shared memory model (e.g. OpenMP). However, the PGAS memory model comes ...
Multi-tasking Execution in PGAS Language XcalableMP and Communication Optimization on Many-core Clusters
HPCAsia '18: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region

Large-scale clusters based on many-core processors such as Intel Xeon Phi have recently been deployed. Multi-tasking execution using task dependencies in OpenMP 4.0 is a promising candidate for facilitating the parallelization of such many-core ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing

November 2007

723 pages

ISBN:9781595937643

DOI:10.1145/1362622

General Chair:
Becky Verastegui
Oak Ridge National Laboratory

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE-CS: Computer Society

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 November 2007

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

SC '07

Sponsor:

SIGARCH
IEEE-CS

SC '07: International Conference for High Performance Computing, Networking, Storage and Analysis

November 10 - 16, 2007

Nevada, Reno

Acceptance Rates

SC '07 Paper Acceptance Rate 54 of 268 submissions, 20%;

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
227
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)1

Reflects downloads up to 10 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Dinan JGrant RBalaji PGoodell DMiller DSnir MThakur R(2014)Enabling communication concurrency through flexible MPI endpointsThe International Journal of High Performance Computing Applications10.1177/109434201454877228:4(390-405)Online publication date: 23-Sep-2014
https://doi.org/10.1177/1094342014548772
Wen KCalhoun DRumley SZhu XLiu YLuo LDing RJones THochberg MLipson MBergman K(2014)Reuse Distance Based Circuit Replacement in Silicon Photonic Interconnection Networks for HPCProceedings of the 2014 IEEE 22nd Annual Symposium on High-Performance Interconnects10.1109/HOTI.2014.20(49-56)Online publication date: 26-Aug-2014
https://dl.acm.org/doi/10.1109/HOTI.2014.20
Fröning HNüssle MLitz HLeber CBrüning UEpema D(2013)On achieving high message ratesProceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing10.1109/CCGrid.2013.43(498-505)Online publication date: 13-May-2013
https://dl.acm.org/doi/10.1109/CCGrid.2013.43
Shan HWright NShalf JYelick KWagner MWichmann N(2012)A preliminary evaluation of the hardware acceleration of the Cray Gemini interconnect for PGAS languages and comparison with MPIACM SIGMETRICS Performance Evaluation Review10.1145/2381056.238107740:2(92-98)Online publication date: 8-Oct-2012
https://dl.acm.org/doi/10.1145/2381056.2381077
Serres OAnbar AMerchant SEl-Ghazawi T(2011)Experiences with UPC on TILE-64 processorProceedings of the 2011 IEEE Aerospace Conference10.1109/AERO.2011.5747452(1-9)Online publication date: 5-Mar-2011
https://dl.acm.org/doi/10.1109/AERO.2011.5747452
Blagojević FHargrove PIancu CYelick KMoreira JIancu CSaraswat V(2010)Hybrid PGAS runtime support for multicore nodesProceedings of the Fourth Conference on Partitioned Global Address Space Programming Model10.1145/2020373.2020376(1-10)Online publication date: 12-Oct-2010
https://dl.acm.org/doi/10.1145/2020373.2020376
Underwood KHemmert KUlmer C(2009)From Silicon to ScienceACM Transactions on Reconfigurable Technology and Systems10.1145/1575779.15757862:4(1-15)Online publication date: 1-Sep-2009
https://dl.acm.org/doi/10.1145/1575779.1575786
Nussle MScherer MBruning U(2009)A Resource Optimized Remote-Memory-Access Architecture for Low-latency CommunicationProceedings of the 2009 International Conference on Parallel Processing10.1109/ICPP.2009.62(220-227)Online publication date: 22-Sep-2009
https://dl.acm.org/doi/10.1109/ICPP.2009.62
Iancu CHofmeyr SMoshovos ATarditi DOlukotun K(2008)Runtime optimization of vector operations on large scale SMP clustersProceedings of the 17th international conference on Parallel architectures and compilation techniques10.1145/1454115.1454134(122-132)Online publication date: 25-Oct-2008
https://dl.acm.org/doi/10.1145/1454115.1454134
Panyong Zhang Can Ma Jie Ma Qiang Li Dan Meng (2008)HPPNET: A novel network for HPC and its implication for communication software2008 IEEE International Symposium on Parallel and Distributed Processing10.1109/IPDPS.2008.4536146(1-8)Online publication date: Apr-2008
https://doi.org/10.1109/IPDPS.2008.4536146
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten