ABSTRACT
In modern processors, prefetching is an essential component for hiding long-latency memory accesses. However, prefetching too aggressively can easily degrade performance by evicting useful data from cache, or by saturating precious memory bandwidth. Tuning the prefetcher's activity is thus an important problem. Existing techniques tend to focus on detecting negative symptoms of aggressive prefetching, such as unused prefetches being evicted or memory bandwidth saturation, and throttle the prefetcher in response.
We argue that these far-side throttling techniques are inefficient because they require significant tracking state, and are reactive to negative effects rather than being proactive. We propose an alternative technique which we coin near-side throttling, which works by detecting late prefetches and tuning the prefetch distance to closely track the point at which most prefetches are not late. Because late prefetches are by definition useful, detecting late prefetches exclusively suffices to detect and prevent useless prefetches as well. Our solution is cheap to implement in hardware, includes throttling on off-chip bandwidth saturation, applies to both hardware and software prefetching, and can control multiple concurrent prefetchers where it will naturally allow the most useful prefetch algorithm to generate most of the requests. Through detailed simulation of a many-core architecture running a wide range of sequential and parallel applications, we show that our near-side throttling (NST) proposal performs similar to the state-of-the-art feedback-directed prefetching (FDP), even though it has a significantly lower implementation cost, can react more quickly to changes in application behavior and is applicable to a more varied set of use cases.
- 2016. APEX Application Benchmarks. http://www.lanl.gov/projects/apex/Google Scholar
- 2016. Intel Cafe. https://github.com/intel/caffeGoogle Scholar
- 2017. SPEC CPU2017 benchmark suite. https://www.spec.org/cpu2017/Google Scholar
- Alaa R. Alameldeen and David A. Wood. 2007. Interactions Between Compression and Prefetching in Chip Multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 228--239. Google ScholarDigital Library
- Jean-Loup Baer and Tien-Fu Chen. 1991. An Effective On-Chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of the ACM/IEEE Conference on Supercomputing. 176--186. Google ScholarDigital Library
- Jean-Loup Baer and Tien-Fu Chen. 1995. Effective Hardware-Based Data Prefetching for High-Performance Processors. IEEE Trans. Comput. 44 (1995), 609--623. Google ScholarDigital Library
- Trevor E. Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulations. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). 52:1--52:12. Google ScholarDigital Library
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 248--255.Google ScholarCross Ref
- Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt. 2011. Prefetch-Aware Shared Resource Management for Multi-Core Systems. In Proceedings of the International Symposium on Computer Architecture (ISCA). 141--152. Google ScholarDigital Library
- Eiman Ebrahimi, Onur Mutlu, Chang Joo Lee, and Yale N. Patt. 2009. Coordinated Control of Multiple Prefetchers in Multi-Core Systems. In Proceedings of the International Symposium on Microarchitecture (MICRO). 316--326. Google ScholarDigital Library
- Ibrahim Hur and Calvin Lin. 2006. Memory Prefetching Using Adaptive Stream Detection. In Proceedings of the International Symposium on Microarchitecture (MICRO). 397--408. Google ScholarDigital Library
- Yasuo Ishii, Mary Inaba, and Kei Hiraki. 2011. Access Map Pattern Matching for High Performance Data Cache Prefetch. Journal of Instruction-Level Parallelism 13 (2011), 1--24.Google Scholar
- Akanksha Jain and Calvin Lin. 2013. Linearizing Irregular Memory Accesses for Improved Correlated Prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO). 247--259. Google ScholarDigital Library
- Victor Jiménez, Roberto Gioiosa, Francisco J. Cazorla, Alper Buyuktosunoglu, Pradip Bose, and Francis P. O'Connell. 2012. Making Data Prefetch Smarter: Adaptive Prefetching on POWER7. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). 137--146. Google ScholarDigital Library
- Norman P. Jouppi. 1990. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of the International Symposium on Computer Architecture (ISCA). 364--373. Google ScholarDigital Library
- Prathmesh Kallurkar and Smruti R. Sarangi. 2016. pTask: A Smart Prefetching Scheme for OS Intensive Applications. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12. Google ScholarDigital Library
- Jinchun Kim, Seth H. Pugsley, Paul V. Gratz, A. L. Narasimha Reddy, Chris Wilkerson, and Zeshan Chishti. 2016. Path Confidence Based Lookahead Prefetching. In Proceedings of the International Symposium on Microarchitecture (MICRO). 1--12. Google ScholarDigital Library
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the International Conference on Neural Information Processing Systems (NIPS). 1097--1105. Google ScholarDigital Library
- Chang Joo Lee, Onur Mutlu, Veynu Narasiman, and Yale N. Patt. 2008. Prefetch-Aware DRAM Controllers. In Proceedings of the International Symposium on Microarchitecture (MICRO). 200--209. Google ScholarDigital Library
- Shih-wei Liao, Tzu-Han Hung, Donald Nguyen, Chinyen Chou, Chiaheng Tu, and Hucheng Zhou. 2009. Machine Learning-Based Prefetch Optimization for Data Center Applications. In Proceedings of the International Conference on High Performance Computing Networking, Storage and Analysis (SC). 56:1--56:10. Google ScholarDigital Library
- Wei-Fen Lin, Steven K. Reinhardt, and Doug Burger. 2001. Reducing DRAM Latencies with an Integrated Memory Hierarchy Design. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 301--312. Google ScholarDigital Library
- Pierre Michaud. 2016. Best-Offset Hardware Prefetching. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 469--480.Google ScholarCross Ref
- Kyle J. Nesbit and James E. Smith. 2005. Data Cache Prefetching Using a Global History Buffer. IEEE Micro 25, 1 (2005), 90--97. Google ScholarDigital Library
- Biswabandan Panda. 2016. SPAC: A Synergistic Prefetcher Aggressiveness Controller for Multi-Core Systems. IEEE Trans. Comput. 65, 12 (Dec 2016), 3740--3753. Google ScholarDigital Library
- Seth H. Pugsley, Zeshan Chishti, Chris Wilkerson, Peng-fei Chuang, Robert L. Scott, Aamer Jaleel, Shih-Lien Lu, Kingsum Chow, and Rajeev Balasubramonian. 2014. Sandbox Prefetching: Safe Run-time Evaluation of Aggressive Prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 626--637.Google ScholarCross Ref
- Andres Rodriguez. 2016. Training and Deploying Deep Learning Networks with Caffe* Optimized for Intel<sup>®</sup> Architecture. Intel Developer Zone.Google Scholar
- Vivek Seshadri, Samihan Yedkar, Hongyi Xin, Onur Mutlu, Phillip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry. 2015. Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks. ACM Transactions on Architecture and Code Optimization (TACO) 11, 4 (Jan. 2015), 51:1--51:22. Google ScholarDigital Library
- Avinash Sodani. 2015. Knights Landing (KNL): 2nd Generation Intel<sup>®</sup> Xeon Phi Processor. In Hot Chips 27 Symposium.Google ScholarCross Ref
- Santhosh Srinath, Onur Mutlu, Hyesoon Kim, and Yale N Patt. 2007. Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. In Proceedings of the International Symposium on High Performance Computer Architecture (HPCA). 63--74. Google ScholarDigital Library
- Carole-Jean Wu, Aamer Jaleel, Margaret Martonosi, Simon C. Steely, Jr., and Joel Emer. 2011. PACMan: Prefetch-Aware Cache Management for High Performance Caching. In Proceedings of the International Symposium on Microarchitecture (MICRO). 442--453. Google ScholarDigital Library
- Carole-Jean Wu and Margaret Martonosi. 2011. Characterization and Dynamic Mitigation of Intra-Application Cache Interference. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 2--11. Google ScholarDigital Library
- Xiangyao Yu, Christopher J. Hughes, Nadathur Satish, and Srinivas Devadas. 2015. IMP: Indirect Memory Prefetcher. In Proceedings of the International Symposium on Microarchitecture (MICRO). 178--190. Google ScholarDigital Library
Recommendations
Prefetch throttling and data pinning for improving performance of shared caches
SC '08: Proceedings of the 2008 ACM/IEEE conference on SupercomputingIn this paper, we (i) quantify the impact of compiler-directed I/O prefetching on shared caches at I/O nodes. The experimental data collected shows that while I/O prefetching brings some benefits, its effectiveness reduces significantly as the number of ...
Reducing Cache Pollution via Dynamic Data Prefetch Filtering
In order to bridge the gap of the growing speed disparity between processors and their memory subsystems, aggressive prefetch mechanisms, either hardware-based or compiler-assisted, are employed to hide memory latencies. As the first-level cache gets ...
Expert Prefetch Prediction: An Expert Predicting the Usefulness of Hardware Prefetchers
Hardware prefetching improves system performance by hiding and tolerating the latencies of lower levels of cache and off-chip DRAM. An accurate prefetcher improves system performance whereas an inaccurate prefetcher can cause cache pollution and consume ...
Comments