Filtering Translation Bandwidth with Virtual Caching

Authors:
Hongil Yoon

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

,
Jason Lowe-Power

University of California, Davis, Davis, CA, USA

University of California, Davis, Davis, CA, USA
View Profile

,
Gurindar S. Sohi

University of Wisconsin-Madison, Madison, WI, USA

University of Wisconsin-Madison, Madison, WI, USA
View Profile

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsMarch 2018Pages 113–127https://doi.org/10.1145/3173162.3173195

Published:19 March 2018Publication History

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Pages 113–127

ABSTRACT

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory access. We observe that future GPUs and workloads show very high bandwidth demands (up to 4 accesses per cycle in some cases) for shared address translation hardware due to frequent private TLB misses. This greatly impacts performance (32% average performance degradation relative to an ideal MMU). To mitigate this overhead, we propose a software-agnostic, practical, GPU virtual cache hierarchy. We use the virtual cache hierarchy as an effective address translation bandwidth filter. We observe many requests that miss in private TLBs find corresponding valid data in the GPU cache hierarchy. With a GPU virtual cache hierarchy, these TLB misses can be filtered (i.e., virtual cache hits), significantly reducing bandwidth demands for the shared address translation hardware. In addition, accelerator-specific attributes (e.g., less likelihood of synonyms) of GPUs reduce the design complexity of virtual caches, making a whole virtual cache hierarchy (including a shared L2 cache) practical for GPUs. Our evaluation shows that the entire GPU virtual cache hierarchy effectively filters the high address translation bandwidth, achieving almost the same performance as an ideal MMU. We also evaluate L1-only virtual cache designs and show that using a whole virtual cache hierarchy obtains additional performance benefits (1.31× speedup on average).

References

{n. d.}. AMD and HSA. ({n. d.}). Retrieved Accessed: 2017-12-09 from http://www.amd.com/en-us/innovations/software-technologies/hsaGoogle Scholar
{n. d.}. The ARM CoreLink CCI-550 Cache Coherent Interconnect. ({n. d.}). Retrieved Accessed: 2017-12-09 from https://developer.arm.com/products/system-ip/corelink-interconnect/corelink-cache-coherent-interconnect-family/corelink-cci-550Google Scholar
Todd M. Austin and Gurindar S. Sohi. 1996. High-bandwidth Address Translation for Multiple-issue Processors. In Proceedings of the 23rd Annual International Symposium on Computer Architecture (ISCA '96). ACM, New York, NY, USA, 158-167. Google ScholarDigital Library
Arkaprava Basu, Jayneel Gandhi, Jichuan Chang, Mark D. Hill, and Michael M. Swift. 2013. Efficient Virtual Memory for Big Memory Servers. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 237-248. Google ScholarDigital Library
Arkaprava Basu, Mark D. Hill, and Michael M. Swift. 2012. Reducing Memory Reference Energy with Opportunistic Virtual Caching. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 297-308. http://dl.acm.org/citation.cfm?id=2337159.2337194 Google ScholarDigital Library
Benjie Batanes. 2016. PS4 Pro Specs: How Does It Fare Against Xbox Project Scorpio? Which One Is Better? (November 2016). Retrieved Accessed: 2017-12-09 from http://www.itechpost.com/articles/50922/20161107/ps4-pro-specs-fare-against-xbox-project-scorpio-one-better.htmGoogle Scholar
A. Bhattacharjee. 2017. Preserving Virtual Memory by Mitigating the Address Translation Wall. IEEE Micro 37, 5 (September 2017), 6-10.Google ScholarDigital Library
Abhishek Bhattacharjee, Daniel Lustig, and Margaret Martonosi. 2011. Shared Last-level TLBs for Chip Multiprocessors. In Proceedings of the 2011 IEEE 17th International Symposium on High Performance Computer Architecture (HPCA '11). IEEE Computer Society, Washington, DC, USA, 62-63. http://dl.acm.org/citation.cfm?id=2014698.2014896 Google ScholarDigital Library
Jeffrey S. Chase, Henry M. Levy, Michael J. Feeley, and Edward D. Lazowska. 1994. Sharing and Protection in a Single-address-space Operating System. ACM Trans. Comput. Syst. 12, 4 (Nov. 1994), 271-307. Google ScholarDigital Library
Jeffrey S. Chase, Henry M. Levy, Edward D. Lazowska, and Miche Baker-Harvey. 1992. Lightweight Shared Objects in a 64-bit Operating System. In Conference Proceedings on Object-oriented Programming Systems, Languages, and Applications (OOPSLA '92). ACM, New York, NY, USA, 397-413. Google ScholarDigital Library
Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), 2013 IEEE International Symposium on. IEEE, 185-195.Google ScholarCross Ref
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization, 2009. IISWC 2009. IEEE International Symposium on. IEEE, 44-54. Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343-355. Google ScholarDigital Library
Ian Cutress. 2017. Hot Chips: Microsoft Xbox One X Scorpio Engine Live Blog. (August 2017). Retrieved Accessed: 2017-12-13 from https://www.anandtech.com/show/11740/hot-chips-microsoft-xbox-one-x-scorpio-engine-live-blog-930am-pt-430pm-utcGoogle Scholar
Koen De Bosschere, Albert Cohen, Jonas Maebe, and Harm Munk. 2015. HiPEAC Vision. (2015).Google Scholar
James R. Goodman. 1987. Coherency for Multiprocessor Virtual Address Caches. SIGPLAN Not. 22, 10 (Oct. 1987), 72-81. Google ScholarDigital Library
Mark Hill, Susan Eggers, Jim Larus, George Taylor, Glenn Adams, B. K. Bose, Garth Gibson, Paul Hansen, Jon Keller, Shing Kong, Corinna Lee, Daebum Lee, Joan Pendleton, Scott Ritchie, David A. Wood, Ben Zorn, Paul Hilfinger, Dave Hodges, Randy Katz, John Ousterhout, and Dave Patterson. 1986. Design Decisions in SPUR. Computer 19, 11 (Nov. 1986), 8-22. Google ScholarDigital Library
Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2014. Heterogeneous-race-free Memory Models. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 427-440. Google ScholarDigital Library
Bruce Jacob. 2009. The Memory System: You Can'T Avoid It, You Can'T Ignore It, You Can'T Fake It. Morgan and Claypool Publishers. Google ScholarDigital Library
Tomas Karnagel, Tal Ben-Nun, Matthias Werner, Dirk Habich, and Wolfgang Lehner. 2017. Big Data Causing Big (TLB) Problems: Taming Random Memory Accesses on the GPU. In Proceedings of the 13th International Workshop on Data Management on New Hardware (DAMON '17). ACM, New York, NY, USA, Article 6, 10 pages. Google ScholarDigital Library
Stefanos Kaxiras and Alberto Ros. 2013. A New Perspective for Efficient Virtual-cache Coherence. In Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). ACM, New York, NY, USA, 535-546. Google ScholarDigital Library
Andy Kegel, Paul Blinzer, Arka Basu, and Maggie Chan. 2016. Virtualizing IO through IO Memory Management Unit. (2016). Retrieved Accessed: 2017-12-09 from http://pages.cs.wisc.edu/~basu/iscaiommututorial/IOMMUTUTORIALASPLOS2016.pdfGoogle Scholar
Hyesoon Kim. 2012. Supporting Virtual Memory in GPGPU Without Supporting Precise Exceptions. In Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness (MSPC '12). ACM, New York, NY, USA, 70-71. Google ScholarDigital Library
Sangman Kim, Seonggu Huh, Yige Hu, Xinya Zhang, Emmett Witchel, Amir Wated, and Mark Silberstein. 2014. GPUnet: Networking Abstractions for GPU Programs. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI'14). USENIX Association, Berkeley, CA, USA, 201-216. http://dl.acm.org/citation.cfm?id=2685048.2685065 Google ScholarDigital Library
Eric J. Koldinger, Jeffrey S. Chase, and Susan J. Eggers. 1992. Architecture Support for Single Address Space Operating Systems. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V). ACM, New York, NY, USA, 175-186. Google ScholarDigital Library
Konstantinos Koukos, Alberto Ros, Erik Hagersten, and Stefanos Kaxiras. 2016. Building Heterogeneous Unified Virtual Memories (UVMs) Without the Overhead. ACM Trans. Archit. Code Optim. 13, 1, Article 1 (March 2016), 22 pages. Google ScholarDigital Library
Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula. 2015. Fusion: Design Tradeoffs in Coherent Cache Hierarchies for Accelerators. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 733-745. Google ScholarDigital Library
George Kyriazis. 2012. Heterogeneous system architecture: A technical review. AMD Fusion Developer Summit (2012).Google Scholar
Jaikrishnan Menon, Marc De Kruijf, and Karthikeyan Sankaralingam. 2012. iGPU: Exception Support and Speculative Execution on GPUs. In Proceedings of the 39th Annual International Symposium on Computer Architecture (ISCA '12). IEEE Computer Society, Washington, DC, USA, 72-83. http://dl.acm.org/citation.cfm?id=2337159.2337168 Google ScholarDigital Library
Juan Navarro, Sitararn Iyer, Peter Druschel, and Alan Cox. 2002. Practical, Transparent Operating System Support for Superpages. SIGOPS Oper. Syst. Rev. 36, SI (Dec. 2002), 89-104. Google ScholarDigital Library
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient Synonym Filtering and Scalable Delayed Translation for Hybrid Virtual Caching. In Proceedings of the 43th Annual International Symposium on Computer Architecture (ISCA '16). IEEE Computer Society, Washington, DC, USA. Google ScholarDigital Library
Binh Pham, Viswanathan Vaidyanathan, Aamer Jaleel, and Abhishek Bhattacharjee. 2012. CoLT: Coalesced Large-Reach TLBs. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-45). IEEE Computer Society, Washington, DC, USA, 258-269. Google ScholarDigital Library
Bharath Pichai, Lisa Hsu, and Abhishek Bhattacharjee. 2014. Architectural Support for Address Translation on GPUs: Designing Memory Management Units for CPU/GPUs with Unified Address Spaces. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '14). ACM, New York, NY, USA, 743-758. Google ScholarDigital Library
Jason Power. 2017. Inferring Kaveri's Shared Virtual Memory Implementation. (July 2017). Retrieved Accessed: 2017-12-09 from http://www.lowepower.com/jason/inferring-kaveris-shared-virtual-memory-implementation.htmlGoogle Scholar
Jason Power, Arkaprava Basu, Junli Gu, Sooraj Puthoor, Bradford M. Beckmann, Mark D. Hill, Steven K. Reinhardt, and David A. Wood. 2013. Heterogeneous system coherence for integrated CPU-GPU systems. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, 457-467. Google ScholarDigital Library
Jason Power, Joel Hestness, Marc Orr, Mark Hill, and David Wood. 2014. gem5-gpu: A Heterogeneous CPU-GPU Simulator. Computer Architecture Letters 13, 1 (Jan 2014).Google Scholar
Jason Power, Mark D. Hill, and David A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In Proceedings of the 2014 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '14). IEEE, 568-578.Google Scholar
Jason Power, Yinan Li, Mark D. Hill, Jignesh M. Patel, and David A. Wood. 2015. Toward GPUs Being Mainstream in Analytic Processing: An Initial Argument Using Simple Scan-aggregate Queries. In Proceedings of the 11th International Workshop on Data Management on New Hardware (DaMoN'15). ACM, New York, NY, USA, Article 11, 8 pages. Google ScholarDigital Library
Kiran Puttaswamy and Gabriel H. Loh. 2006. Thermal Analysis of a 3D Die-stacked High-performance Microprocessor. In Proceedings of the 16th ACM Great Lakes Symposium on VLSI (GLSVLSI '06). ACM, New York, NY, USA, 19-24. Google ScholarDigital Library
Xiaogang Qiu and Michel Dubois. 2001. Towards virtually-addressed memory hierarchies. In Proceedings of the 2001 IEEE 7th International Symposium on High Performance Computer Architecture (HPCA '01). 51-62. Google ScholarDigital Library
Xiaogang Qiu and Michel Dubois. 2008. The Synonym Lookaside Buffer: A Solution to the Synonym Problem in Virtual Caches. IEEE Trans. Comput. 57, 12 (Dec. 2008), 1585-1599. Google ScholarDigital Library
Jude A. Rivers, Gary S. Tyson, Edward S. Davidson, and Todd M. Austin. 1997. On High-bandwidth Data Cache Design for Multi-issue Processors. In Proceedings of the 30th Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-30). IEEE Computer Society, Washington, DC, USA, 46-56. http://dl.acm.org/citation.cfm?id=266800.266805 Google ScholarDigital Library
Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating a File System with GPUs. In Proceedings of the 18th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '13). ACM, New York, NY, USA, 485-498. Google ScholarDigital Library
Abhayendra Singh, Shaizeen Aga, and Satish Narayanasamy. 2015. Efficiently Enforcing Strong Memory Ordering in GPUs. In Proceedings of the 48th International Symposium on Microarchitecture (MICRO-48). ACM, New York, NY, USA, 699-712. Google ScholarDigital Library
Inderpreet Singh, Arrvindh Shriraman, Wilson W. L. Fung, Mike O'Connor, and Tor M. Aamodt. 2013. Cache Coherence for GPU Architectures. In Proceedings of the 2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA '13). IEEE Computer Society, Washington, DC, USA, 578-590. Google ScholarDigital Library
Avinash Sodani. 2011. Race to Exascale: Opportunities and Challenges (MICRO 2011 Keynote talk).Google Scholar
J. Vesely, A. Basu, M. Oskin, G. H. Loh, and A. Bhattacharjee. 2016. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 161-171.Google Scholar
W. H. Wang, J.-L. Baer, and H. M. Levy. 1989. Organization and Performance of a Two-level Virtual-real Cache Hierarchy. In Proceedings of the 16th Annual International Symposium on Computer Architecture (ISCA '89). ACM, New York, NY, USA, 140-148. Google ScholarDigital Library
Neil H. E. Weste and Kamran Eshraghian. 1985. Principles of CMOS VLSI Design: A Systems Perspective. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Google ScholarDigital Library
H. Wong, M. M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. 2010. Demystifying GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of Systems Software (ISPASS). 235-246.Google Scholar
D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton. 1986. An In-cache Address Translation Mechanism. In Proceedings of the 13th Annual International Symposium on Computer Architecture (ISCA '86). IEEE Computer Society Press, Los Alamitos, CA, USA, 358-365. http://dl.acm.org/citation.cfm?id=17407.17398 Google ScholarDigital Library
H. Yoon and G. S. Sohi. 2016. Revisiting virtual L1 caches: A practical design using dynamic synonym remapping. In Proceedings of the 2016 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA '16). 212-224.Google Scholar
Lixin Zhang, Evan Speight, Ram Rajamony, and Jiang Lin. 2010. Enigma: Architectural and Operating System Support for Reducing the Impact of Address Translation. In Proceedings of the 24th ACM International Conference on Supercomputing (ICS '10). ACM, New York, NY, USA, 159-168. Google ScholarDigital Library

Index Terms

Filtering Translation Bandwidth with Virtual Caching
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Virtual memory

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Read More
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Read More
Efficient synonym filtering and scalable delayed translation for hybrid virtual caching
ISCA'16

Conventional translation look-aside buffers (TLBs) are required to complete address translation with short latencies, as the address translation is on the critical path of all memory accesses even for L1 cache hits. Such strict TLB latency restrictions ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems
March 2018
827 pages
ISBN:9781450349116
DOI:10.1145/3173162
General Chairs:
Xipeng Shen
North Carolina State University, USA
,
James Tuck
North Carolina State University, USA
,
Program Chairs:
Ricardo Bianchini
Microsoft Research, USA
,
Vivek Sarkar
Georgia Institute of Technology, USA
ACM SIGPLAN Notices Volume 53, Issue 2
ASPLOS '18
February 2018
809 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/3296957
Editor:
Matthew Fluet
Rodchester Institude of Technology
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 March 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
TLB
address translation
heterogeneous computing
virtual caching
virtual memory
Qualifiers
- research-article
Conference

Acceptance Rates
ASPLOS '18 Paper Acceptance Rate56of319submissions,18%Overall Acceptance Rate535of2,713submissions,20%
More
Upcoming Conference
ASPLOS '24

Sponsor:

sigarch

sigarch

sigarch

29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems

April 27 - May 1, 2024

La Jolla , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 839
  Total Downloads
- Downloads (Last 12 months)144
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Filtering Translation Bandwidth with Virtual Caching

ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Filtering Translation Bandwidth with Virtual Caching

Efficient synonym filtering and scalable delayed translation for hybrid virtual caching