skip to main content
10.1145/2749469.2749471acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Redundant memory mappings for fast access to large memories

Published: 13 June 2015 Publication History

Abstract

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes.
To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications.
We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.

References

[1]
"Huge Pages Part 1 (Introduction)," http://lwn.net/Articles/374424/.
[2]
"Intel 8086 - Wikipedia," http://en.wikipedia.org/wiki/Intel_8086.
[3]
"Intel® itanium® architecture developer's manual, vol. 2," http://www.intel.com/content/www/us/en/processors/itanium/itanium-architecture-s-oftware-developer-rev-2-3-vol-2-manual.html.
[4]
"perf: Linux profiling with performance counters," https://perf.wiki.kernel.org/index.php/Main_Page.
[5]
"TCMalloc," http://goog-perftools.sourceforge.net/doc/tcmalloc.html.
[6]
"Transparent Huge Pages in 2.6.38," http://lwn.net/Articles/423584/.
[7]
K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung, "BioBench: A Benchmark Suite of Bioinformatics Applications," in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005, pp. 2--9, 2005.
[8]
T. W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don'T Walk (the Page Table)," in Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 48--59, 2010.
[9]
T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 307--318, 2011.
[10]
A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in Proceedings of the 40th Annual International Symposium on Computer Architecture, pp. 237--248, 2013.
[11]
A. Basu, M. D. Hill, and M. M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," in Proceedings of the 39th Annual International Symposium on Computer Architecture, pp. 297--308, 2012.
[12]
A. Bhattacharjee, "Large-reach Memory Management Unit Caches," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 383--394, 2013.
[13]
A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-level TLBs for Chip Multiprocessors," in Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture, pp. 62--63, 2011.
[14]
A. Bhattacharjee and M. Martonosi, "Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors," in Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 29--40, 2009.
[15]
C. Bienia, "Benchmarking Modern Multiprocessors," Ph.D. dissertation, Princeton University, January 2011.
[16]
D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, "Translation Lookaside Buffer Consistency: A Software Approach," in Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 113--122, 1989.
[17]
S. M. Blackburn and K. S. McKinley, "Immix: A Mark-region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance," in Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 22--32, 2008.
[18]
N. Cohen and E. Petrank, "Limitations of partial compaction: Towards practical bounds," SIGPLAN Not., vol. 48, no. 6, pp. 309--320, 2013.
[19]
C. Ding and K. Kennedy, "Inter-array Data Regrouping," in Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pp. 149--163, 2000.
[20]
Y. Du, M. Zhou, B. Childers, D. Mosse, and R. Melhem, "Supporting superpages in non-contiguous physical memory," in Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, pp. 223--234, Feb 2015.
[21]
M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware," in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 37--48, 2012.
[22]
J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "BadgerTrap: A Tool to Instrument x86-64 TLB Misses," SIGARCH Comput. Archit. News, vol. 42, no. 2, pp. 20--23, Sep. 2014.
[23]
J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks," in MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 178--189, 2014.
[24]
J. L. Greathouse, H. Xin, Y. Luo, and T. Austin, "A Case for Unlimited Watchpoints," in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 159--172, 2012.
[25]
J. L. Henning, "SPEC CPU2006 Benchmark Descriptions," SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1--17, Sep. 2006.
[26]
Intel Corporation, "Introduction to the iAPX 432 Architecture," 1981, no. 171821-001.
[27]
Intel Corporation, "TLBs, Paging-Structure Caches and their Invalidation," 2008, no. 317080-003.
[28]
Intel Corporation, "Intel® 64 and IA-32 Architectures Optimization Reference Manual," April 2012, no. 248966-026.
[29]
B. Jacob and T. Mudge, "Virtual Memory in Contemporary Microprocessors," IEEE Micro, vol. 18, no. 4, pp. 60--75, Jul. 1998.
[30]
G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-driven Study," in Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 195--206, 2002.
[31]
V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift, "Performance Analysis of the Memory Management Unit under Scale-out Workloads," in Proceedings of the 2014 IEEE International Symposium on Workload Characterization, pp. 1--12, 2014.
[32]
J.-Y. Kim and H.-J. Yoo, "Bitwise Competition Logic for Compact Digital Comparator," in Proceedings of the 2007 IEEE Asian Solid-State Circuits Conference, 2007.
[33]
W. Lonehgan and P. King, "Design of the b 5000 system," Datamation, vol. 7, no. 5, May 1961.
[34]
D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," ACM Trans. Archit. Code Optim., vol. 10, no. 1, pp. 2:1--2:38, Apr. 2013.
[35]
MIPS Technologies, Incorporated, "MIPS32 Architecture for Programmers Volume iii: The MIPS Privileged Resource Architecture," 2001, no. MD00090, Revision 0.95.
[36]
J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," in Proceedings of the 5th Symposium on Operating Systems Design and implementation, pp. 89--104, 2002.
[37]
M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-based superpage-friendly TLB designs," in Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, pp. 210--222, Feb 2015.
[38]
B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, "Increasing TLB reach by exploiting clustering in page translations," in Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, pp. 558--567, 2014.
[39]
B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258--269, 2012.
[40]
S. Phillips, "M7: Next Generation SPARC," in Hot Chips: A Symposium on High Performance Chips, 2014.
[41]
D. Quintero, S. Chabrolles, C. H. Chen, M. Dhandapani, T. Holloway, C. Jadhav, S. K. Kim, S. Kurian, B. Raj, R. Resende, B. Roden, N. Srinivasan, R. Wale, W. Zanatta, and Z. Zhang, "IBM Power Systems Performance Guide Implementing and Optimizing," 2013.
[42]
A. Saulsbury, F. Dahlgren, and P. Stenström, "Recency-based TLB Preloading," in Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 117--127, 2000.
[43]
A. Seznec, "A Case for Two-way Skewed-associative Caches," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 169--178, 1993.
[44]
M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smittle, and T. Ziaja, "Sparc T4: A Dynamically Threaded Server-on-a-Chip," IEEE Micro, vol. 32, no. 2, pp. 8--19, Mar. 2012.
[45]
S. Srikantaiah and M. Kandemir, "Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 313--324, 2010.
[46]
Sun Microsystems, "UltraSPARC T2 Supplement to the UltraSPARC Architecture 2007."
[47]
M. Talluri and M. D. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 171--182, 1994.
[48]
M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, and T. Sherwood, "A Small Cache of Large Ranges: Hardware Methods for Efficiently Searching, Storing, and Updating Big Dataflow Tags," in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pp. 94--105, 2008.
[49]
E. Witchel, J. Cates, and K. Asanović, "Mondrian Memory Protection," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 304--316, 2002.
[50]
D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton, "An In-cache Address Translation Mechanism," in Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 358--365, 1986.

Cited By

View all
  • (2025)Contiguity aware TLB prefetching for embedded I/O devicesIEICE Electronics Express10.1587/elex.21.2024066422:3(20240664-20240664)Online publication date: 10-Feb-2025
  • (2025)Instruction-Aware Cooperative TLB and Cache Replacement PoliciesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707247(619-636)Online publication date: 30-Mar-2025
  • (2025)TLB Coalescing With Range Compressed Page Table for Embedded I/O DevicesIEEE Access10.1109/ACCESS.2025.352894513(12623-12633)Online publication date: 2025
  • Show More Cited By

Index Terms

  1. Redundant memory mappings for fast access to large memories

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
      June 2015
      768 pages
      ISBN:9781450334020
      DOI:10.1145/2749469
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 June 2015

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Conference

      ISCA '15
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 543 of 3,203 submissions, 17%

      Upcoming Conference

      ISCA '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)108
      • Downloads (Last 6 weeks)15
      Reflects downloads up to 01 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2025)Contiguity aware TLB prefetching for embedded I/O devicesIEICE Electronics Express10.1587/elex.21.2024066422:3(20240664-20240664)Online publication date: 10-Feb-2025
      • (2025)Instruction-Aware Cooperative TLB and Cache Replacement PoliciesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707247(619-636)Online publication date: 30-Mar-2025
      • (2025)TLB Coalescing With Range Compressed Page Table for Embedded I/O DevicesIEEE Access10.1109/ACCESS.2025.352894513(12623-12633)Online publication date: 2025
      • (2024)FBMMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692040(785-798)Online publication date: 10-Jul-2024
      • (2024)Scalable and effective page-table and TLB management on NUMA systemsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692020(445-461)Online publication date: 10-Jul-2024
      • (2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
      • (2024)Direct Memory Translation for Virtualized CloudsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640358(287-304)Online publication date: 27-Apr-2024
      • (2024)TrackFM: Far-out Compiler Support for a Far Memory WorldProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624856(401-419)Online publication date: 27-Apr-2024
      • (2024)UpDown: Combining Scalable Address Translation with Locality ControlProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00141(1014-1024)Online publication date: 17-Nov-2024
      • (2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media