research-article

Redundant memory mappings for fast access to large memories

Authors:

Vasileios Karakostas,

Jayneel Gandhi,

Adrián Cristal,

Kathryn S. McKinley,

Mario Nemirovsky,

Michael M. Swift,

Osman ÜnsalAuthors Info & Claims

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 66 - 78

https://doi.org/10.1145/2749469.2749471

Published: 13 June 2015 Publication History

Abstract

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast memory with stagnating TLB sizes.

To reduce the overhead of virtual memory, this paper proposes Redundant Memory Mappings (RMM), which leverage ranges of pages and provides an efficient, alternative representation of many virtual-to-physical mappings. We define a range be a subset of process's pages that are virtually and physically contiguous. RMM translates each range with a single range table entry, enabling a modest number of entries to translate most of the process's address space. RMM operates in parallel with standard paging and uses a software range table and hardware range TLB with arbitrarily large reach. We modify the operating system to automatically detect ranges and to increase their likelihood with eager page allocation. RMM is thus transparent to applications.

We prototype RMM software in Linux and emulate the hardware. RMM performs substantially better than paging alone and huge pages, and improves a wider variety of workloads than direct segments (one range per program), reducing the overhead of virtual memory to less than 1% on average.

References

[1]

"Huge Pages Part 1 (Introduction)," http://lwn.net/Articles/374424/.

[2]

"Intel 8086 - Wikipedia," http://en.wikipedia.org/wiki/Intel_8086.

[3]

"Intel® itanium® architecture developer's manual, vol. 2," http://www.intel.com/content/www/us/en/processors/itanium/itanium-architecture-s-oftware-developer-rev-2-3-vol-2-manual.html.

[4]

"perf: Linux profiling with performance counters," https://perf.wiki.kernel.org/index.php/Main_Page.

[5]

"TCMalloc," http://goog-perftools.sourceforge.net/doc/tcmalloc.html.

[6]

"Transparent Huge Pages in 2.6.38," http://lwn.net/Articles/423584/.

[7]

K. Albayraktaroglu, A. Jaleel, X. Wu, M. Franklin, B. Jacob, C.-W. Tseng, and D. Yeung, "BioBench: A Benchmark Suite of Bioinformatics Applications," in Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005, pp. 2--9, 2005.

Digital Library

[8]

T. W. Barr, A. L. Cox, and S. Rixner, "Translation Caching: Skip, Don'T Walk (the Page Table)," in Proceedings of the 37th Annual International Symposium on Computer Architecture, pp. 48--59, 2010.

Digital Library

[9]

T. W. Barr, A. L. Cox, and S. Rixner, "SpecTLB: A Mechanism for Speculative Address Translation," in Proceedings of the 38th Annual International Symposium on Computer Architecture, pp. 307--318, 2011.

Digital Library

[10]

A. Basu, J. Gandhi, J. Chang, M. D. Hill, and M. M. Swift, "Efficient Virtual Memory for Big Memory Servers," in Proceedings of the 40th Annual International Symposium on Computer Architecture, pp. 237--248, 2013.

Digital Library

[11]

A. Basu, M. D. Hill, and M. M. Swift, "Reducing Memory Reference Energy with Opportunistic Virtual Caching," in Proceedings of the 39th Annual International Symposium on Computer Architecture, pp. 297--308, 2012.

Digital Library

[12]

A. Bhattacharjee, "Large-reach Memory Management Unit Caches," in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 383--394, 2013.

Digital Library

[13]

A. Bhattacharjee, D. Lustig, and M. Martonosi, "Shared Last-level TLBs for Chip Multiprocessors," in Proceedings of the 17th IEEE International Symposium on High Performance Computer Architecture, pp. 62--63, 2011.

Digital Library

[14]

A. Bhattacharjee and M. Martonosi, "Characterizing the TLB Behavior of Emerging Parallel Workloads on Chip Multiprocessors," in Proceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques, pp. 29--40, 2009.

Digital Library

[15]

C. Bienia, "Benchmarking Modern Multiprocessors," Ph.D. dissertation, Princeton University, January 2011.

Digital Library

[16]

D. L. Black, R. F. Rashid, D. B. Golub, and C. R. Hill, "Translation Lookaside Buffer Consistency: A Software Approach," in Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 113--122, 1989.

Digital Library

[17]

S. M. Blackburn and K. S. McKinley, "Immix: A Mark-region Garbage Collector with Space Efficiency, Fast Collection, and Mutator Performance," in Proceedings of the 2008 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 22--32, 2008.

Digital Library

[18]

N. Cohen and E. Petrank, "Limitations of partial compaction: Towards practical bounds," SIGPLAN Not., vol. 48, no. 6, pp. 309--320, 2013.

Digital Library

[19]

C. Ding and K. Kennedy, "Inter-array Data Regrouping," in Proceedings of the 12th International Workshop on Languages and Compilers for Parallel Computing, pp. 149--163, 2000.

Digital Library

[20]

Y. Du, M. Zhou, B. Childers, D. Mosse, and R. Melhem, "Supporting superpages in non-contiguous physical memory," in Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, pp. 223--234, Feb 2015.

[21]

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsafi, "Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware," in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 37--48, 2012.

Digital Library

[22]

J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "BadgerTrap: A Tool to Instrument x86-64 TLB Misses," SIGARCH Comput. Archit. News, vol. 42, no. 2, pp. 20--23, Sep. 2014.

Digital Library

[23]

J. Gandhi, A. Basu, M. D. Hill, and M. M. Swift, "Efficient Memory Virtualization: Reducing Dimensionality of Nested Page Walks," in MICRO-47: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 178--189, 2014.

Digital Library

[24]

J. L. Greathouse, H. Xin, Y. Luo, and T. Austin, "A Case for Unlimited Watchpoints," in Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 159--172, 2012.

Digital Library

[25]

J. L. Henning, "SPEC CPU2006 Benchmark Descriptions," SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1--17, Sep. 2006.

Digital Library

[26]

Intel Corporation, "Introduction to the iAPX 432 Architecture," 1981, no. 171821-001.

[27]

Intel Corporation, "TLBs, Paging-Structure Caches and their Invalidation," 2008, no. 317080-003.

[28]

Intel Corporation, "Intel® 64 and IA-32 Architectures Optimization Reference Manual," April 2012, no. 248966-026.

[29]

B. Jacob and T. Mudge, "Virtual Memory in Contemporary Microprocessors," IEEE Micro, vol. 18, no. 4, pp. 60--75, Jul. 1998.

Digital Library

[30]

G. B. Kandiraju and A. Sivasubramaniam, "Going the Distance for TLB Prefetching: An Application-driven Study," in Proceedings of the 29th Annual International Symposium on Computer Architecture, pp. 195--206, 2002.

Digital Library

[31]

V. Karakostas, O. S. Unsal, M. Nemirovsky, A. Cristal, and M. Swift, "Performance Analysis of the Memory Management Unit under Scale-out Workloads," in Proceedings of the 2014 IEEE International Symposium on Workload Characterization, pp. 1--12, 2014.

[32]

J.-Y. Kim and H.-J. Yoo, "Bitwise Competition Logic for Compact Digital Comparator," in Proceedings of the 2007 IEEE Asian Solid-State Circuits Conference, 2007.

[33]

W. Lonehgan and P. King, "Design of the b 5000 system," Datamation, vol. 7, no. 5, May 1961.

[34]

D. Lustig, A. Bhattacharjee, and M. Martonosi, "TLB Improvements for Chip Multiprocessors: Inter-Core Cooperative Prefetchers and Shared Last-Level TLBs," ACM Trans. Archit. Code Optim., vol. 10, no. 1, pp. 2:1--2:38, Apr. 2013.

Digital Library

[35]

MIPS Technologies, Incorporated, "MIPS32 Architecture for Programmers Volume iii: The MIPS Privileged Resource Architecture," 2001, no. MD00090, Revision 0.95.

[36]

J. Navarro, S. Iyer, P. Druschel, and A. Cox, "Practical, Transparent Operating System Support for Superpages," in Proceedings of the 5th Symposium on Operating Systems Design and implementation, pp. 89--104, 2002.

Digital Library

[37]

M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos, "Prediction-based superpage-friendly TLB designs," in Proceedings of the 21st IEEE International Symposium on High Performance Computer Architecture, pp. 210--222, Feb 2015.

[38]

B. Pham, A. Bhattacharjee, Y. Eckert, and G. H. Loh, "Increasing TLB reach by exploiting clustering in page translations," in Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, pp. 558--567, 2014.

[39]

B. Pham, V. Vaidyanathan, A. Jaleel, and A. Bhattacharjee, "CoLT: Coalesced Large-Reach TLBs," in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 258--269, 2012.

Digital Library

[40]

S. Phillips, "M7: Next Generation SPARC," in Hot Chips: A Symposium on High Performance Chips, 2014.

[41]

D. Quintero, S. Chabrolles, C. H. Chen, M. Dhandapani, T. Holloway, C. Jadhav, S. K. Kim, S. Kurian, B. Raj, R. Resende, B. Roden, N. Srinivasan, R. Wale, W. Zanatta, and Z. Zhang, "IBM Power Systems Performance Guide Implementing and Optimizing," 2013.

[42]

A. Saulsbury, F. Dahlgren, and P. Stenström, "Recency-based TLB Preloading," in Proceedings of the 27th Annual International Symposium on Computer Architecture, pp. 117--127, 2000.

Digital Library

[43]

A. Seznec, "A Case for Two-way Skewed-associative Caches," in Proceedings of the 20th Annual International Symposium on Computer Architecture, pp. 169--178, 1993.

Digital Library

[44]

M. Shah, R. Golla, G. Grohoski, P. Jordan, J. Barreh, J. Brooks, M. Greenberg, G. Levinsky, M. Luttrell, C. Olson, Z. Samoail, M. Smittle, and T. Ziaja, "Sparc T4: A Dynamically Threaded Server-on-a-Chip," IEEE Micro, vol. 32, no. 2, pp. 8--19, Mar. 2012.

Digital Library

[45]

S. Srikantaiah and M. Kandemir, "Synergistic TLBs for High Performance Address Translation in Chip Multiprocessors," in Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, pp. 313--324, 2010.

Digital Library

[46]

Sun Microsystems, "UltraSPARC T2 Supplement to the UltraSPARC Architecture 2007."

[47]

M. Talluri and M. D. Hill, "Surpassing the TLB Performance of Superpages with Less Operating System Support," in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 171--182, 1994.

Digital Library

[48]

M. Tiwari, B. Agrawal, S. Mysore, J. Valamehr, and T. Sherwood, "A Small Cache of Large Ranges: Hardware Methods for Efficiently Searching, Storing, and Updating Big Dataflow Tags," in Proceedings of the 41st Annual IEEE/ACM International Symposium on Microarchitecture, pp. 94--105, 2008.

Digital Library

[49]

E. Witchel, J. Cates, and K. Asanović, "Mondrian Memory Protection," in Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 304--316, 2002.

Digital Library

[50]

D. A. Wood, S. J. Eggers, G. Gibson, M. D. Hill, and J. M. Pendleton, "An In-cache Address Translation Mechanism," in Proceedings of the 13th Annual International Symposium on Computer Architecture, pp. 358--365, 1986.

Digital Library

Cited By

Duong THur J(2025)Contiguity aware TLB prefetching for embedded I/O devicesIEICE Electronics Express10.1587/elex.21.2024066422:3(20240664-20240664)Online publication date: 10-Feb-2025
https://doi.org/10.1587/elex.21.20240664
Chasapis DVavouliotis GJiménez DCasas MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Instruction-Aware Cooperative TLB and Cache Replacement PoliciesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707247(619-636)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707247
Dai Duong TSeung Kim YYoung Hur J(2025)TLB Coalescing With Range Compressed Page Table for Embedded I/O DevicesIEEE Access10.1109/ACCESS.2025.352894513(12623-12633)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3528945
Show More Cited By

Index Terms

Redundant memory mappings for fast access to large memories
1. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

Redundant memory mappings for fast access to large memories
ISCA'15

Page-based virtual memory improves programmer productivity, security, and memory utilization, but incurs performance overheads due to costly page table walks after TLB misses. This overhead can reach 50% for modern workloads that access increasingly vast ...
Design of heterogeneously-integrated memory system with storage class memories and NAND flash memories
ASPDAC '19: Proceedings of the 24th Asia and South Pacific Design Automation Conference

Heterogeneously-integrated memory system is configured with various types of storage class memories (SCMs) and NAND flash memories. SCMs are faster than NAND flash, and they are divided into memory and storage types with their characteristics. NAND ...
Memory-fast computer memories

The search for new dynamic RAM (DRAM) technologies to reduce memory access time and so unleash computer performance is discussed. The typical memory hierarchy of internal registers, cache, main memory, and mass storage is described. DRAM technologies ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

June 2015

768 pages

ISBN:9781450334020

DOI:10.1145/2749469

General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

IEEE TCCA: IEEE Computer Society Technical Committee on Computer Architecture
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article

Conference

ISCA '15

Sponsor:

IEEE TCCA
SIGARCH

ISCA '15: The 42nd Annual International Symposium on Computer Architecture

June 13 - 17, 2015

Oregon, Portland

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Sponsor:
sigarch

The 52nd Annual International Symposium on Computer Architecture

June 21 - 25, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

147
Total Citations
View Citations
1,361
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)15

Reflects downloads up to 01 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Duong THur J(2025)Contiguity aware TLB prefetching for embedded I/O devicesIEICE Electronics Express10.1587/elex.21.2024066422:3(20240664-20240664)Online publication date: 10-Feb-2025
https://doi.org/10.1587/elex.21.20240664
Chasapis DVavouliotis GJiménez DCasas MEeckhout LSmaragdakis GLiang KSampson AKim MRossbach C(2025)Instruction-Aware Cooperative TLB and Cache Replacement PoliciesProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3669940.3707247(619-636)Online publication date: 30-Mar-2025
https://dl.acm.org/doi/10.1145/3669940.3707247
Dai Duong TSeung Kim YYoung Hur J(2025)TLB Coalescing With Range Compressed Page Table for Embedded I/O DevicesIEEE Access10.1109/ACCESS.2025.352894513(12623-12633)Online publication date: 2025
https://doi.org/10.1109/ACCESS.2025.3528945
Tabatabai BSorenson JSwift MBagchi SZhang Y(2024)FBMMProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692040(785-798)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692040
Gao BKang QTee HChu KSanaee AJevdjic DBagchi SZhang Y(2024)Scalable and effective page-table and TLB management on NUMA systemsProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692020(445-461)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.5555/3691992.3692020
Qu HYu ZTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)WASP: Workload-Aware Self-Replicating Page-Tables for NUMA ServersProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640369(1233-1249)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640369
Zhang JJia WChai SLiu PKim JXu TTsafrir DMusuvathi MGupta RAbu-Ghazaleh N(2024)Direct Memory Translation for Virtualized CloudsProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3620665.3640358(287-304)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3620665.3640358
Tauro BSuchy BCampanoni SDinda PHale KTsafrir DMUSUVATHI MGupta RAbu-Ghazaleh N(2024)TrackFM: Far-out Compiler Support for a Far Memory WorldProceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 110.1145/3617232.3624856(401-419)Online publication date: 27-Apr-2024
https://dl.acm.org/doi/10.1145/3617232.3624856
Wang YPerarnau SChien A(2024)UpDown: Combining Scalable Address Translation with Locality ControlProceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1109/SCW63240.2024.00141(1014-1024)Online publication date: 17-Nov-2024
https://dl.acm.org/doi/10.1109/SCW63240.2024.00141
Li BWang YWang TEeckhout LYang JJaleel ATang X(2024)STAR: Sub-Entry Sharing-Aware TLB for Multi-Instance GPU2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO)10.1109/MICRO61859.2024.00031(309-323)Online publication date: 2-Nov-2024
https://doi.org/10.1109/MICRO61859.2024.00031
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten