ABSTRACT
Multicore processors constitute the main architecture choice for modern computing systems in different market segments. Despite their benefits, the contention that naturally appears when multiple applications compete for the use of shared resources among cores, such as the last-level cache (LLC), may lead to substantial performance degradation. This may have a negative impact on key system aspects such as throughput and fairness. Assigning the various applications in the workload to separate LLC partitions with possibly different sizes, has been proven effective to mitigate shared-resource contention effects.
In this article we propose LFOC, a clustering-based cache partitioning scheme that strives to deliver fairness while providing acceptable system throughput. LFOC leverages the Intel Cache Allocation Technology (CAT), which enables the system software to divide the LLC into different partitions. To accomplish its goals, LFOC tries to mimic the behavior of the optimal cache-clustering solution, which we could approximate by means of a simulator in different scenarios. To this end, LFOC effectively identifies streaming aggressor programs and cache sensitive applications, which are then assigned to separate cache partitions.
We implemented LFOC in the Linux kernel and evaluated it on a real system featuring an Intel Skylake processor, where we compare its effectiveness to that of two state-of-the-art policies that optimize fairness and throughput, respectively. Our experimental analysis reveals that LFOC is able to bring a higher reduction in unfairness by leveraging a lightweight algorithm suitable for adoption in a real OS.
- J. Brock et al. 2015. Optimal Cache Partition-Sharing. In Proceedings of the 2015 44th International Conference on Parallel Processing (ICPP) (ICPP '15). 749--758. Google ScholarDigital Library
- E. Ebrahimi et al. 2010. Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. In 15th Int'l Conf. Architectural Support Programming Lang. and Oper. Syst. (ASPLOS 10). 335--346. Google ScholarDigital Library
- N. El-Sayed et al. 2018. KPart: A Hybrid Cache Partitioning-Sharing Technique for Commodity Multicores. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). 104--117.Google Scholar
- N. El-Sayed et al. 2018. Source Code of KPart. https://github.com/Nosayba/kpart. Accessed: 2019--02--20.Google Scholar
- S. Eyerman and L. Eeckhout. 2008. System-Level Performance Metrics for Multi-program Workloads. IEEE Micro 28, 3 (May 2008), 42--53. Google ScholarDigital Library
- J. Feliu et al. 2016. Perf & Fair: a Progress-Aware Scheduler to Enhance Performance and Fairness in SMT Multicores. IEEE Trans. Comput. PP, 99 (2016). Google ScholarDigital Library
- L. Funaro, O. A. Ben-Yehuda, and A. Schuster. 2016. Ginseng: Market-driven LLC Allocation. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC '16). 295--308. Google ScholarDigital Library
- A. Garcia-Garcia, J. Casas, and J. C. Saez. 2019. PBBCache: A parallel branch-and-bound based cache-partitioning simulator. https://github.com/pbbcache/cachesim. Accessed: 2019--05--10.Google Scholar
- A. Garcia-Garcia, J. C. Saez, and M. Prieto-Matias. 2018. Contention-Aware Fair Scheduling for Asymmetric Single-ISA Multicore Systems. IEEE Trans. Comput. 67, 12 (Dec 2018), 1703--1719.Google ScholarDigital Library
- S. M. Khan et al. 2014. Improving cache performance using read-write partitioning. In 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014. 452--463.Google ScholarCross Ref
- D. Lo et al. 2015. Heracles: improving resource efficiency at scale. In Proc. of the 42nd Annual International Symposium on Computer Architecture. 450--462. Google ScholarDigital Library
- R. Love. 2010. Linux Kernel Development (3rd ed.). Addison-Wesley Professional. Google ScholarDigital Library
- R. Manikantan, K. Rajan, and R. Govindarajan. 2012. Probabilistic Shared Cache Management (PriSM). In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). 428--439. Google ScholarDigital Library
- S. Mittal. 2017. A Survey of Techniques for Cache Partitioning in Multicore Processors. ACM Comput. Surv. 50, 2, Article 27 (May 2017), 27:1--27:39 pages. Google ScholarDigital Library
- T.Y. Morad et al. 2016. EFS: Energy-Friendly Scheduler for memory bandwidth constrained systems. J. Parallel and Distrib. Comput. 95 (2016), 3--14. Google ScholarDigital Library
- A. Mukkara, N. Beckmann, and D. Sanchez. 2016. Whirlpool: Improving Dynamic Cache Management with Static Data Classification. In Proc. of the 21st Int'l Conf. on Arch. Support for Programming Lang. and Oper. Syst. (ASPLOS '16). 113--127. Google ScholarDigital Library
- O. Mutlu and T. Moscibroda. 2007. Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. In 40th Ann. IEEE/ACM Int'l Symp. on Microarchitecture (MICRO 07). 146--160. Google ScholarDigital Library
- K. Nguyen. 2016. Introduction to Cache Allocation Technology in the Intel Xeon Processor E5 v4 Family. https://software.intel.com/en-us/articles/introduction-to-cache-allocation-technology. Accessed: 2019--03--20.Google Scholar
- M.K. Qureshi and Y.N. Patt. 2006. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. In Proceedings of MICRO 06. 423--432. Google ScholarDigital Library
- J.C. Saez et al. 2017. PMCTrack: Delivering Performance Monitoring Counter Support to the OS Scheduler. Comput. J. 60, 1 (2017), 60--85.Google ScholarCross Ref
- J.C. Saez et al. 2017. Towards completely fair scheduling on asymmetric single-ISA multicore processors. J. Parallel and Distrib. Comput. 102 (2017), 115--131. Google ScholarDigital Library
- J.C. Saez, J.I. Gomez, and M. Prieto. 2008. Improving Priority Enforcement via Non-Work-Conserving Scheduling. In ICPP '08: Proceedings of the 2008 37th International Conference on Parallel Processing. 99--106. Google ScholarDigital Library
- A. Scolari, D.B. Bartolini, and M.D. Santambrogio. 2016. A Software Cache Partitioning System for Hash-Based Caches. ACM Trans. Archit. Code Optim. 13, 4, Article 57 (Dec. 2016), 57:1--57:24 pages. Google ScholarDigital Library
- V. Selfa et al. 2017. Application Clustering Policies to Address System Fairness with Intel's Cache Allocation Technology. In 2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). 194--205.Google ScholarCross Ref
- T. Sherwood, B. Calder, and J. Emer. 1999. Reducing Cache Misses Using Hardware and Software Page Placement. In Proceedings of the 13th International Conference on Supercomputing (ICS '99). 155--164. Google ScholarDigital Library
- L. Subramanian et al. 2015. The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-application Interference at Shared Caches and Main Memory. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). 62--75. Google ScholarDigital Library
- K. Van Craeynest et al. 2013. Fairness-aware scheduling on single-ISA heterogeneous multi-cores. In 22nd Int'l Conf. Parallel Arch. Compilation Techniques (PACT 13). 177--187. Google ScholarDigital Library
- R. Wang and L. Chen. 2014. Futility Scaling: High-Associativity Cache Partitioning. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). 356--367. Google ScholarDigital Library
- D. Xu et al. 2012. Providing Fairness on Shared-memory Multiprocessors via Process Scheduling. In Proc. ACM Int'l Conf. Measurement and Modeling Comp. Syst. (SIGMETRICS 12). 295--306. Google ScholarDigital Library
- Y. Ye et al. 2014. COLORIS: A Dynamic Cache Partitioning System Using Page Coloring. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT '14). 381--392. Google ScholarDigital Library
- C. Yu and P. Petrov. 2010. Off-chip Memory Bandwidth Minimization Through Cache Partitioning for Multi-core Platforms. In Proceedings of the 47th Design Automation Conference (DAC '10). 132--137. Google ScholarDigital Library
- H. Yun et al. 2014. PALLOC: DRAM bank-aware memory allocator for performance isolation on multicore platforms. In 20th Real-Time Embedded Tech. and Applications Symp. (RTAS 14). 155--166.Google ScholarCross Ref
- H. Yun et al. 2016. Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms. IEEE Trans. Comput. 65, 2 (Feb 2016), 562--576. Google ScholarDigital Library
- X. Zhang, S. Dwarkadas, and K. Shen. 2009. Towards Practical Page Coloring-based Multicore Cache Management. In Proceedings of the 4th ACM European Conference on Computer Systems (EuroSys '09). 89--102. Google ScholarDigital Library
- H. Zhu and M. Erez. 2016. Dirigent: Enforcing QoS for Latency-Critical Tasks on Shared Multicore Systems. In Proc. of the 21st Int'l Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS '16). 33--47. Google ScholarDigital Library
- S. Zhuravlev et al. 2012. Survey of Scheduling Techniques for Addressing Shared Resources in Multicore Processors. ACM Comput. Surv. 45, 1, Article 4 (Dec. 2012), 28 pages. Google ScholarDigital Library
Index Terms
- LFOC: A Lightweight Fairness-Oriented Cache Clustering Policy for Commodity Multicores
Recommendations
Exploring cache bypassing and partitioning for multi-tasking on GPUs
ICCAD '17: Proceedings of the 36th International Conference on Computer-Aided DesignGraphics Processing Units (GPUs) computing has become ubiquitous for embedded system, evidenced by its wide adoption for various general purpose applications. As more and more applications are accelerated by GPUs, multi-tasking scenario starts to ...
Towards completely fair scheduling on asymmetric single-ISA multicore processors
Single-ISA asymmetric multicore processors (AMPs), which combine high-performance big cores with low-power small cores, were shown to deliver higher performance per watt than symmetric CMPs (Chip Multi-Processors). Previous work has highlighted that ...
Parallelism via Multithreaded and Multicore CPUs
Multicore and multithreaded CPUs have become the new approach to obtaining increases in CPU performance. Numeric applications mostly benefit from a large number of computationally powerful cores. Servers typically benefit more if chip circuitry is used ...
Comments