skip to main content
10.1145/3295500.3356196acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Mitigating network noise on Dragonfly networks through application-aware routing

Published: 17 November 2019 Publication History

Abstract

System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates more traffic since packets traverse additional hops, causing in turn congestion on other applications and on the application itself. In this paper, we first describe how to estimate network noise. By following these guidelines, we show how noise can be reduced by using routing algorithms which select minimal paths with a higher probability. We exploit this knowledge to design an algorithm which changes the probability of selecting minimal paths according to the application characteristics. We validate our solution on microbenchmarks and real-world applications on two systems relying on a Dragonfly interconnection network, showing noise reduction and performance improvement.

References

[1]
[n. d.]. Aries Hardware Counters (4.0) S-0045. https://pubs.cray.com/content/S-0045/4.0/aries-hardware-counters/about-ariestm-hardware-counters-s-0045. ([n. d.]). Accessed: 17-05-2019.
[2]
[n. d.]. Crystal viscoplasticity proxy application. https://github.com/exmatex/VPFFT. ([n. d.]). Accessed: 17-05-2019.
[3]
[n. d.]. Ember Communication Pattern Library. https://github.com/sstsimulator/ember. ([n. d.]). Accessed: 10-04-2019.
[4]
[n. d.]. Nekbone. https://github.com/Nek5000/Nekbone. ([n. d.]). Accessed: 17-05-2019.
[5]
Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A Scalable, Commodity Data Center Network Architecture. SIGCOMM Comput. Commun. Rev. 38, 4 (Aug. 2008), 63--74.
[6]
Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01-1112 (2012).
[7]
Abdulla Bataineh, Thomas Court, and Duncan Roweth. 2017. Increasingly minimal bias routing. (2 2017).
[8]
G. Bauer, S. Gottlieb, and T. Hoefler. 2012. Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3 rmd. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012). IEEE Computer Society, 652--659.
[9]
M. Besta and T. Hoefler. 2014. Slim Fly: A Cost Effective Low-Diameter Network Topology. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 348--359.
[10]
A. Bhatele, N. Jain, Y. Livnat, V. Pascucci, and P. Bremer. 2016. Analyzing Network Health and Congestion in Dragonfly-Based Supercomputers. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 93--102.
[11]
Abhinav Bhatele and Laxmikant V. Kalé. 2009. Quantifying Network Contention on Large Parallel Machines. Parallel Processing Letters 19, 04 (2009), 553--572. arXiv:https://doi.org/10.1142/S0129626409000419
[12]
James M. Brandt, Edwin Froese, Ann C. Gentile, Larry Kaplan, Benjamin A. Allan, and Edward J Walsh. 2016. Network Performance Counter Monitoring and Analysis on the Cray XC Platform. (5 2016).
[13]
Sudheer Chunduri, Kevin Harms, Scott Parker, Vitali Morozov, Samuel Oshin, Naveen Cherukuri, and Kalyan Kumaran. 2017. Run-to-run Variability on Xeon Phi Based Cray XC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '17). ACM, New York, NY, USA, Article 52, 13 pages.
[14]
David E. Culler, Richard M. Karp, David Patterson, Abhijit Sahay, Eunice E. Santos, Klaus Erik Schauser, Ramesh Subramonian, and Thorsten von Eicken. 1996. LogP: A Practical Model of Parallel Computation. Commun. ACM 39, 11 (Nov. 1996), 78--85.
[15]
Jack Dongarra, Michael Heroux, and Piotr Luszczek. 2015. HPCG Benchmark: a New Metric for Ranking High Performance Computing Systems. Technical Report ut-eecs-15-736. http://www.eecs.utk.edu/resources/library/file/1047/ut-eecs-15-736.pdf
[16]
P. Faizian, M. S. Rahman, M. A. Mollah, X. Yuan, S. Pakin, and M. Lang. 2016. Traffic Pattern-Based Adaptive Routing for Intra-Group Communication in Dragonfly Networks. In 2016 IEEE 24th Annual Symposium on High-Performance Interconnects (HOTI). 19--26.
[17]
Ahmad Faraj, Pitch Patarasuk, and Xin Yuan. 2008. A Study of Process Arrival Patterns for MPI Collective Operations. Int. J. Parallel Program. 36, 6 (Dec. 2008), 543--570.
[18]
M. Frigo and S. G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (Feb 2005), 216--231.
[19]
P. Fuentes, E. Vallejo, M. García, R. Beivide, G. Rodríguez, C. Minkenberg, and M. Valero. 2015. Contention-Based Nonminimal Adaptive Routing in High-Radix Networks. In 2015 IEEE International Parallel and Distributed Processing Symposium. 103--112.
[20]
Mark Giampapa, Thomas Gooding, Todd Inglett, and Robert W. Wisniewski. 2010. Experiences with a Lightweight Supercomputer Kernel: Lessons Learned from Blue Gene's CNK. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--10.
[21]
Paolo Giannozzi, Stefano Baroni, Nicola Bonini, Matteo Calandra, Roberto Car, Carlo Cavazzoni, Davide Ceresoli, Guido L Chiarotti, Matteo Cococcioni, Ismaila Dabo, Andrea Dal Corso, Stefano de Gironcoli, Stefano Fabris, Guido Fratesi, Ralph Gebauer, Uwe Gerstmann, Christos Gougoussis, Anton Kokalj, Michele Lazzeri, Layla Martin-Samos, Nicola Marzari, Francesco Mauri, Riccardo Mazzarello, Stefano Paolini, Alfredo Pasquarello, Lorenzo Paulatto, Carlo Sbraccia, Sandro Scandolo, Gabriele Sclauzero, Ari P Seitsonen, Alexander Smogunov, Paolo Umari, and Renata M Wentzcovitch. 2009. QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials. Journal of Physics: Condensed Matter 21, 39 (sep 2009), 395502.
[22]
Steven Gottlieb, W. Liu, William D Toussaint, R. L. Renken, and R. L. Sugar. 1987. Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. Physical review D: Particles and fields 35, 8 (1987), 2531--2542.
[23]
Ryan E. Grant, Kevin T. Pedretti, and Ann Gentile. 2015. Overtime: A Tool for Analyzing Performance Variation Due to Network Interference. In Proceedings of the 3rd Workshop on Exascale MPI (ExaMPI '15). ACM, New York, NY, USA, Article 4, 10 pages.
[24]
T. Groves, Y. Gu, and N. J. Wright. 2017. Understanding Performance Variability on the Aries Dragonfly Network. In 2017 IEEE International Conference on Cluster Computing (CLUSTER). 809--813.
[25]
T. Hoefler, P. Kambadur, R. L. Graham, G. Shipman, and A. Lumsdaine. 2007. A Case for Standard Non-Blocking Collective Operations. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, EuroPVM/MPI 2007, Vol. 4757. Springer, 125--134.
[26]
T. Hoefler, A. Lumsdaine, and W. Rehm. 2007. Implementation and Performance Analysis of Non-Blocking Collective Operations for MPI. In Proceedings of the 2007 International Conference on High Performance Computing, Networking, Storage and Analysis, SC07. IEEE Computer Society/ACM.
[27]
T. Hoefler, T. Schneider, and A. Lumsdaine. 2009. The impact of network noise at large-scale communication performance. In 2009 IEEE International Symposium on Parallel Distributed Processing. 1--8.
[28]
Torsten Hoefler, Timo Schneider, and Andrew Lumsdaine. 2010. Characterizing the Influence of System Noise on Large-Scale Applications by Simulation. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC '10). IEEE Computer Society, Washington, DC, USA, 1--11.
[29]
Jurg Hutter, Marcella Iannuzzi, Florian Schiffmann, and Joost VandeVondele. 2014. cp2k: atomistic simulations of condensed matter systems. Wiley Interdisciplinary Reviews: Computational Molecular Science 4, 1 (2014), 15--25. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.1159
[30]
N. Jain, A. Bhatele, X. Ni, N. J. Wright, and L. V. Kale. 2014. Maximizing Throughput on a Dragonfly Network. In SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 336--347.
[31]
Nan Jiang, John Kim, and William J. Dally. 2009. Indirect Adaptive Routing on Large Scale Interconnection Networks. SIGARCH Comput. Archit. News 37, 3 (June 2009), 220--231.
[32]
G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. 2015. Cost-Effective Diameter-Two Topologies: Analysis and Evaluation. ACM. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC15).
[33]
G. Kathareios, C. Minkenberg, B. Prisacari, G. Rodriguez, and T. Hoefler. 2015. Cost-effective diameter-two topologies: analysis and evaluation. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--11.
[34]
J. Kim, W. J. Dally, S. Scott, and D. Abts. 2008. Technology-Driven, Highly-Scalable Dragonfly Topology. In 2008 International Symposium on Computer Architecture. 77--88.
[35]
Philip J. Mucci, Shirley Browne, Christine Deane, and George Ho. 1999. PAPI: A Portable Interface to Hardware Performance Counters. In In Proceedings of the Department of Defense HPCMP Users Group Conference. 7--10.
[36]
Richard C Murphy, Kyle B Wheeler, Brian W Barrett, and James A Ang. 2010. Introducing the graph 500. Cray Users Group (CUG) 19 (2010), 45--74.
[37]
Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1--19.
[38]
Samuel D. Pollard, Nikhil Jain, Stephen Herbein, and Abhinav Bhatele. 2018. Evaluation of an Interference-free Node Allocation Policy on Fat-tree Clusters. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC '18). IEEE Press, Piscataway, NJ, USA, Article 26, 13 pages. http://dl.acm.org/citation.cfm?id=3291656.3291691
[39]
Bogdan Prisacari, German Rodriguez, Philip Heidelberger, Dong Chen, Cyriel Minkenberg, and Torsten Hoefler. 2014. Efficient Task Placement and Routing of Nearest Neighbor Exchanges in Dragonfly Networks. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing (HPDC '14). ACM, New York, NY, USA, 129--140.
[40]
Romelia Salomon-Ferrer, David A. Case, and Ross C. Walker. 2013. An overview of the Amber biomolecular simulation package. Wiley Interdisciplinary Reviews: Computational Molecular Science 3, 2 (2013), 198--210. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/wcms.1121
[41]
William C. Skamarock, Joseph B. Klemp, Jimy Dudhia, David O. Gill, Dale M. Barker, Wei Wang, and Jordan G. Powers. 2008. A description of the Advanced Research WRF version 3. NCAR Technical note -475+STR. (2008).
[42]
D. Skinner and W. Kramer. 2005. Understanding the causes of performance variability in HPC workloads. In IEEE International. 2005 Proceedings of the IEEE Workload Characterization Symposium, 2005. 137--149.
[43]
Staci A. Smith, Clara E. Cromey, David K. Lowenthal, Jens Domke, Nikhil Jain, Jayaraman J. Thiagarajan, and Abhinav Bhatele. 2018. Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '18).
[44]
H. M. Tufo and P. F. Fischer. 1999. Terascale Spectral Element Algorithms and Implementations. In Proceedings of the 1999 ACM/IEEE Conference on Supercomputing (SC '99). ACM, New York, NY, USA, Article 68.
[45]
X. Wang, M. Mubarak, X. Yang, R. B. Ross, and Z. Lan. 2018. Trade-Off Study of Localizing Communication and Balancing Network Traffic on a Dragonfly System. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 1113--1122.
[46]
J. Won, G. Kim, J. Kim, T. Jiang, M. Parker, and S. Scott. 2015. Overcoming far-end congestion in large-scale networks. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 415--427.
[47]
X. Yang, J. Jenkins, M. Mubarak, R. B. Ross, and Z. Lan. 2016. Watch Out for the Bully! Job Interference Study on Dragonfly Network. In SC '16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 750--760.

Cited By

View all
  • (2024)Exploring GPU-to-GPU Communication: Insights into Supercomputer InterconnectsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00039(1-15)Online publication date: 17-Nov-2024
  • (2024)An Evaluation of the Effect of Network Cost Optimization for Leadership Class SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00037(1-16)Online publication date: 17-Nov-2024
  • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SC '19: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
November 2019
1921 pages
ISBN:9781450362290
DOI:10.1145/3295500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

  • IEEE CS

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 November 2019

Permissions

Request permissions for this article.

Check for updates

Badges

Author Tags

  1. dragonfly
  2. network noise
  3. routing

Qualifiers

  • Research-article

Funding Sources

  • European Research Council (ERC)

Conference

SC '19
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,516 of 6,373 submissions, 24%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)47
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring GPU-to-GPU Communication: Insights into Supercomputer InterconnectsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00039(1-15)Online publication date: 17-Nov-2024
  • (2024)An Evaluation of the Effect of Network Cost Optimization for Leadership Class SupercomputersSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00037(1-16)Online publication date: 17-Nov-2024
  • (2024)CanaryFuture Generation Computer Systems10.1016/j.future.2023.10.010152:C(70-82)Online publication date: 4-Mar-2024
  • (2024)Analysis and prediction of performance variability in large-scale computing systemsThe Journal of Supercomputing10.1007/s11227-024-06040-wOnline publication date: 28-Mar-2024
  • (2023)Noise in the Clouds: Influence of Network Performance Variability on Application ScalabilityACM SIGMETRICS Performance Evaluation Review10.1145/3606376.359355551:1(17-18)Online publication date: 27-Jun-2023
  • (2023)Noise in the Clouds: Influence of Network Performance Variability on Application ScalabilityAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems10.1145/3578338.3593555(17-18)Online publication date: 19-Jun-2023
  • (2023)Workload Interference Prevention with Intelligent Routing and Flexible Job Placement on DragonflyProceedings of the 2023 ACM SIGSIM Conference on Principles of Advanced Discrete Simulation10.1145/3573900.3591119(23-33)Online publication date: 21-Jun-2023
  • (2023)An Analysis of Long-Tailed Network Latency Distribution and Background Traffic on Dragonfly+Benchmarking, Measuring, and Optimizing10.1007/978-3-031-31180-2_8(123-142)Online publication date: 13-May-2023
  • (2022)Noise in the CloudsProceedings of the ACM on Measurement and Analysis of Computing Systems10.1145/35706096:3(1-27)Online publication date: 8-Dec-2022
  • (2022)Study of Workload Interference with Intelligent Routing on DragonflySC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00025(1-14)Online publication date: Nov-2022
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media