research-article

Runtime-Guided Mitigation of Manufacturing Variability in Power-Constrained Multi-Socket NUMA Nodes

Authors:

Dimitrios Chasapis,

Miquel Moretó,

Eduard Ayguadé,

Mateo ValeroAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 5, Pages 1 - 12

https://doi.org/10.1145/2925426.2926279

Published: 01 June 2016 Publication History

Abstract

Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. Researchers in academia, labs and industry are focusing on dealing with this "power wall", striving to find a balance between performance and power consumption. Some commodity processors enable power capping, which opens up new opportunities for applications to directly manage their power behavior at user level. However, while power capping ensures a system will never exceed a given power limit, it also leads to a new form of heterogeneity: natural manufacturing variability, which was previously hidden by varying power to achieve homogeneous performance, now results in heterogeneous performance caused by different CPU frequencies, potentially for each core, to enforce the power limit.

In this work we show how a parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi-core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost.

References

[1]

P. E. Bailey, A. Marathe, D. K. Lowenthal, B. Rountree, and M. Schulz. Finding the limits of power-constrained application performance. In SC, pages 79:1--79:12, 2015.

Digital Library

[2]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In PACT, pages 72--81, 2008.

Digital Library

[3]

R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In PPoPP, pages 207--216, 1995.

Digital Library

[4]

R. D. Blumofe and C. E. Leiserson. Scheduling multithreaded computations by work stealing. J. ACM, 46(5):720--748, Sept. 1999.

Digital Library

[5]

S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De. Parameter variations and impact on circuits and microarchitecture. In DAC, pages 338--342, 2003.

Digital Library

[6]

BSC. Programming models group. the nanos++ parallel runtime. https://pm.bsc.es/nanox, 2015.

[7]

M. Casas, R. M. Badia, and J. Labarta. Automatic phase detection and structure extraction of mpi applications. Int. J. High Perform. Comput. Appl., 24(3):335--360, Aug. 2010.

Digital Library

[8]

M. Casas, M. Moreto, L. Alvarez, E. Castillo, D. Chasapis, T. Hayes, L. Jaulmes, O. Palomar, O. Unsal, A. Cristal, E. Ayguade, J. Labarta, and M. Valero. Euro-Par 2015, chapter Runtime-Aware Architectures, pages 16--27. August 2015.

[9]

D. Chasapis, M. Casas, M. Moretó, R. Vidal, E. Ayguadé, J. Labarta, and M. Valero. Parsecss: Evaluating the impact of task parallelism in the parsec benchmark suite. ACM Trans. Archit. Code Optim., 12(4):41:1--41:22, Dec. 2015.

Digital Library

[10]

R. Cochran, C. Hankendi, A. K. Coskun, and S. Reda. Pack & cap: Adaptive dvfs and thread packing under power caps. In MICRO, pages 175--185, 2011.

Digital Library

[11]

J. D. Davis, S. Rivoire, M. Goldszmidt, and E. K. Ardestani. Accounting for Variability in Large-Scale Cluster Power Models. In EXERT, 2011.

[12]

J. W. Demmel. Applied Numerical Linear Algebra. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1997.

Digital Library

[13]

D. A. Ellsworth, A. D. Malony, B. Rountree, and M. Schulz. POW: System-wide Dynamic Reallocation of Limited Power in HPC. In HPDC, pages 145--148, 2015.

Digital Library

[14]

M. Etinski, J. Corbalan, J. Labarta, and M. Valero. Linear programming based parallel job scheduling for power constrained systems. In HPCS, pages 72--80, July 2011.

[15]

L. R. Harriott. Limits of lithography. Proceedings of the IEEE, 89(3):366--374, Mar 2001.

[16]

S. Herbert, S. Garg, and D. Marculescu. Exploiting process variability in voltage/frequency control. IEEE Trans. Very Large Scale Integr. Syst., 20(8):1392--1404, Aug. 2012.

Digital Library

[17]

S. Herbert and D. Marculescu. Variation-aware dynamic voltage/frequency scaling. In HPCA, pages 301--312, 2009.

[18]

Y. Inadomi, T. Patki, K. Inoue, M. Aoyagi, B. Rountree, M. Schulz, D. Lowenthal, Y. Wada, K. Fukazawa, M. Ueda, M. Kondo, and I. Miyoshi. Analyzing and mitigating the impact of manufacturing variability in power-constrained supercomputing. In SC, pages 78:1--78:12, 2015.

Digital Library

[19]

Intel. Intel-64 and IA-32 Architectures Software Developer's Manual. Intel, December 2011.

[20]

K. E. Isaacs, A. Bhatele, J. Lifflander, D. Böhme, T. Gamblin, M. Schulz, B. Hamann, and P.-T. Bremer. Recovering logical structure from charm++ event traces. In SC, pages 49:1--49:12, 2015.

Digital Library

[21]

B. Lin, A. Mallik, P. Dinda, G. Memik, and R. Dick. User- and process-driven dynamic voltage and frequency scaling. In ISPASS, pages 11--22, April 2009.

[22]

Livermore Computing. The Catalyst supercomputer. http://computation.llnl.gov/computers/catalyst, 2014.

[23]

A. Marathe, P. Bailey, D. Lowenthal, B. Rountree, M. Schulz, and B. de Supinski. A run-time system for power-constrained HPC applications. In High Performance Computing, volume 9137 of Lecture Notes in Computer Science, pages 394--408. 2015.

[24]

T. Patki, D. K. Lowenthal, B. Rountree, M. Schulz, and B. R. de Supinski. Exploring hardware overprovisioning in power-constrained, high performance computing. In ICS, pages 173--182, 2013.

Digital Library

[25]

N. Rajovic, P. Carpenter, I. Gelado, N. Puzovic, A. Ramirez, and M. Valero. Supercomputing with commodity CPUs: Are mobile SoCs ready for HPC? In SC, pages 1--12, Nov 2013.

Digital Library

[26]

K. Ravichandran, S. Lee, and S. Pande. Work stealing for multi-core hpc clusters. In Euro-Par, pages 205--217, 2011.

Digital Library

[27]

B. Rountree, D. Ahn, B. de Supinski, D. Lowenthal, and M. Schulz. Beyond DVFS: A first look at performance under a hardware-enforced power bound. In IPDPS Workshops PhD Forum, pages 947--953, May 2012.

Digital Library

[28]

P. B. S. Ashby and, J. Chen, P. Colella, B. Collins, D. Crawford, J. Dongarra, D. Kothe, R. Lusk, P. Messina, T. Mezzacappa, P. Moin, M. Norman, R. Rosner, V. Sarkar, A. Siegel, F. Streitz, A. White, and M. Wright. The opportunities and challenges of exascale computing. DOE Technical Report, 2010.

[29]

S. Samaan. The impact of device parameter variations on the frequency and performance of VLSI chips. In ICCAD, pages 343--346, Nov 2004.

Digital Library

[30]

O. Sarood, A. Langer, A. Gupta, and L. Kale. Maximizing throughput of overprovisioned hpc data centers under a strict power budget. In SC, pages 807--818, 2014.

Digital Library

[31]

K. Shoga, B. Rountree, and M. Schulz. Whitelisting MSRs with msr-safe, November 2014.

[32]

R. Teodorescu and J. Torrellas. Variation-aware application scheduling and power management for chip multiprocessors. SIGARCH Comput. Archit. News, 36(3):363--374, June 2008.

Digital Library

[33]

E. Totoni, J. Torrellas, and L. V. Kale. Using an adaptive hpc runtime system to reconfigure the cache hierarchy. In SC, pages 1047--1058, 2014.

Digital Library

[34]

J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, and V. De. Adaptive body bias for reducing impacts of die-to-die and within-die parameter variations on microprocessor frequency and leakage. Solid-State Circuits, IEEE Journal of, 37(11):1396--1402, Nov 2002.

[35]

M. Valero, M. Moreto, M. Casas, E. Ayguade, and J. Labarta. Runtime-aware architectures: A first approach. Supercomputing frontiers and innovations, 1(1), 2014.

[36]

G. Zheng, A. Bhatelé, E. Meneses, and L. V. Kalé. Periodic hierarchical load balancing for large supercomputers. Int. J. High Perform. Comput. Appl., 25(4):371--385, Nov. 2011.

Digital Library

Cited By

Jain RTran BChen KSinclair MVenkataraman S(2024)PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00032(1-18)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00032
Solórzano ASato KYamamoto KShoji FBrandt JSchwaller BWalton SGreen JTiwari D(2024)Toward Sustainable HPC: In-Production Deployment of Incentive-Based Power Efficiency Mechanism on the Fugaku SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00030(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00030
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Show More Cited By

Recommendations

MrPhi: An Optimized MapReduce Framework on Intel Xeon Phi Coprocessors
In this work, we develop MrPhi, an optimized MapReduce framework on a heterogeneous computing platform, particularly equipped with multiple Intel Xeon Phi coprocessors. To the best of our knowledge, this is the first work to optimize the MapReduce ...
Runtime coordinated heterogeneous tasks in charm++
ESPM2: Proceedings of the Second Internationsl Workshop on Extreme Scale Programming Models and Middleware

Effective utilization of the increasingly heterogeneous hardware in modern supercomputers is a significant challenge. Many applications have seen performance gains by using GPUs, but many implementations leave CPUs sitting idle.

In this paper, we ...
Intra-Socket and Inter-Socket Communication in Multi-core Systems

The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-offs involved in multi-core computing platforms. Such analysis can help application and toolkit ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 Public Domain.

This paper is authored by an employee(s) of the United States Government and is in the public domain. Non-exclusive copying or redistribution is allowed, provided that the article citation is given and the authors and agency are clearly identified as its source.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

16
Total Citations
View Citations
229
Total Downloads

Downloads (Last 12 months)12
Downloads (Last 6 weeks)0

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jain RTran BChen KSinclair MVenkataraman S(2024)PAL: A Variability-Aware Policy for Scheduling ML Workloads in GPU ClustersProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00032(1-18)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00032
Solórzano ASato KYamamoto KShoji FBrandt JSchwaller BWalton SGreen JTiwari D(2024)Toward Sustainable HPC: In-Production Deployment of Incentive-Based Power Efficiency Mechanism on the Fugaku SupercomputerProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC41406.2024.00030(1-16)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SC41406.2024.00030
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Sinha PGuliani AJain RTran BSinclair MVenkataraman S(2022)Not All GPUs Are Created Equal: Characterizing Variability in Large-Scale, Accelerator-Rich SystemsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00070(01-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00070
HE YWADA YLUO WSAKAMOTO RPAN GCAO TKONDO M(2021)Efficient and Precise Profiling, Modeling and Management on Power and Performance for Power Constrained HPC SystemsIEICE Transactions on Electronics10.1587/transele.2020LHP0005E104.C:6(237-246)Online publication date: 1-Jun-2021
https://doi.org/10.1587/transele.2020LHP0005
Ma YHe YWada YLuo WSakamoto RKondo M(2021)Mitigating Process Variations with Cooperative Tuning for Performance and Power through a Simple DSL2021 Ninth International Symposium on Computing and Networking Workshops (CANDARW)10.1109/CANDARW53999.2021.00023(94-100)Online publication date: Nov-2021
https://doi.org/10.1109/CANDARW53999.2021.00023
Imes CHofmeyr SKang DWalters J(2020)A Case Study and Characterization of a Many-socket, Multi-tier NUMA HPC Platform2020 IEEE/ACM 6th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing (HiPar)10.1109/LLVMHPCHiPar51896.2020.00013(74-84)Online publication date: Nov-2020
https://doi.org/10.1109/LLVMHPCHiPar51896.2020.00013
Imes CColin AZhang NSrivastava APrasanna VWalters J(2020)Compiler Abstractions and Runtime for Extreme-scale SAR and CFD Workloads2020 IEEE/ACM Fifth International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)10.1109/ESPM251964.2020.00010(1-7)Online publication date: Nov-2020
https://doi.org/10.1109/ESPM251964.2020.00010
Chasapis DMoretó MSchulz MRountree BValero MCasas MEigenmann RDing CMcKee S(2019)Power efficient job scheduling by predicting the impact of processor manufacturing variabilityProceedings of the ACM International Conference on Supercomputing10.1145/3330345.3330372(296-307)Online publication date: 26-Jun-2019
https://dl.acm.org/doi/10.1145/3330345.3330372
Zou PFeng XGe R(2019)Contention Aware Workload and Resource Co-Scheduling on Power-Bounded Systems2019 IEEE International Conference on Networking, Architecture and Storage (NAS)10.1109/NAS.2019.8834721(1-8)Online publication date: Aug-2019
https://doi.org/10.1109/NAS.2019.8834721
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten