ABSTRACT
Multithreaded programs are commonly written and optimized for homogeneous multi-core processors assuming equal performance from all the cores. This assumption greatly simplifies the partitioning and balancing of an application's workload across threads; however, it no longer holds when the frequencies of the cores differ due to within-die variations, leading to a degradation in performance. We observe that, in addition to the frequency of the core that it executes on, the performance of a thread is also dependent on the share of shared system resources, such as last-level cache, that it receives. We propose variation-aware cache partitioning as an approach to redress the variation-induced imbalance in the execution times of threads, thereby improving the performance of multi-threaded programs. We discuss the challenges involved in realizing our proposal, including synchronization (e.g., barriers) across threads, which results in faster threads being limited by slower threads, the complex and non-linear relationship between a thread's performance and the cache capacity allocated to it, and the fact that different program phases, can respond quite differently to varying cache capacity. We propose a runtime scheme to perform spatio-temporal cache partitioning while considering both chip characteristics (frequency variations) and program characteristics. We evaluate the proposed technique by applying it to an ensemble of variation-impacted multi-cores executing multi-threaded programs from the PARSEC and SPEC-OMP suites, and demonstrate that it results in an average performance improvement of 15% by mitigating the impact of frequency variations.
- S. Dighe et al. Within-die variation-aware dynamic voltage frequency scaling with optimal core allocation and thread hopping for the 80-core teraflops processor. Trans. JSSC, 46(1), 2011.Google Scholar
- J. Sartori et al. Variation-aware speed binning of multi-core processors. In Proc. ISQED, pages 307--314, 2010.Google ScholarCross Ref
- G. E. Suh, L. Rudolph, and S. Devadas. Dynamic partitioning of shared cache memory. J. Supercomput., 28(1):7--26, April 2004. Google ScholarDigital Library
- F. Guo et al. Quality of service shared cache management in chip multiprocessor architecture. ACM TACO, 7(3):14:1--14:33, 2010. Google ScholarDigital Library
- S. R. Sarangi et al. VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects. IEEE Trans. Semiconductor Manufacturing, 21(1):3 --13, 2008.Google ScholarCross Ref
- S. Eyerman et al. A performance counter architecture for computing accurate CPI components. In Proc. ASPLOS, 2006. Google ScholarDigital Library
- T. Cormen et al. Introduction to Algorithms. McGraw-Hill Higher Education, 2nd edition, 2001. Google ScholarDigital Library
- J. Engblom et al. Full-system simulation from embedded to high-performance systems. In Processor and System-on-Chip Simulation, pages 25--45. Springer US, 2010.Google ScholarCross Ref
- M. M. K. Martin et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, November 2005. Google ScholarDigital Library
- C. Bienia et al. The PARSEC benchmark suite: Characterization and architectural implications. In Proc. PACT, 2008. Google ScholarDigital Library
- V. Aslot et al. SPEComp: A new benchmark suite for measuring parallel computer performance. In Proc. WOMPAT, pages 1--10, London, UK, UK, 2001. Springer-Verlag. Google ScholarDigital Library
- K. K. Rangan et al. Achieving uniform performance and maximizing throughput in the presence of heterogeneity. In Proc. HPCA, pages 3--14, 2011. Google ScholarDigital Library
- R. Teodorescu et al. Variation-aware application scheduling and power management for chip multiprocessors. In Proc. ISCA, 2008. Google ScholarDigital Library
- A. Bhattacharjee et al. Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors. In Proc. ISCA, 2009. Google ScholarDigital Library
- S. Herbert et al. Variation-aware dynamic voltage/frequency scaling. In Proc. HPCA, pages 301--312, 2009.Google ScholarCross Ref
- S. Kim et al. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proc. PACT, pages 111--122, 2004. Google ScholarDigital Library
- A. Pan et al. Imbalanced cache partitioning for balanced data-parallel programs. In Proc. Micro, pages 297--309, 2013. Google ScholarDigital Library
- S.P. Muralidhara et al. Intra-application cache partitioning. In Proc. IPDPS, pages 1--12, 2010.Google ScholarCross Ref
- M. Kandemir et al. A helper thread based dynamic cache partitioning scheme for multithreaded applications. In Proc. DAC, pages 954--959, 2011. Google ScholarDigital Library
Index Terms
- Variation Aware Cache Partitioning for Multithreaded Programs
Recommendations
PACP: A Prefetch-aware Multi-core Shared Cache Partitioning Strategy
ICCAI '22: Proceedings of the 8th International Conference on Computing and Artificial IntelligenceIn multi-core systems, hardware prefetchers aggravate the preemption of some access-intensive programs for shared last level cache (LLC) resources, resulting in lower system performance. As a solution, we propose a prefetch-aware multi-core shared cache ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer ArchitectureOn-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...
Multicore Cache Simulations Using Heterogeneous Computing on General Purpose and Graphics Processors
DSD '11: Proceedings of the 2011 14th Euromicro Conference on Digital System DesignTraditional trace-driven memory system simulation is a very time consuming process while the advent of multicores simply exacerbates the problem. We propose a framework for accelerating trace-driven multicore cache simulations by utilizing the ...
Comments