research-article

TokenTLB: A Token-Based Page Classification Approach

Authors:

Antonio Robles,

María E. Gómez,

José DuatoAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 26, Pages 1 - 13

https://doi.org/10.1145/2925426.2926280

Published: 01 June 2016 Publication History

Abstract

Classifying memory accesses into private or shared data has become a fundamental approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor.

This work proposes TokenTLB, a page classification approach based on exchange and count of tokens. The key observation behind our proposal is that, opposed to coherence management, data classification meets all the benefits of a token-based approach without the burden of complex arbitration mechanisms, which has discouraged the implementation of token-based coherence protocols in commodity systems. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two or more TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. TokenTLB also favors shareability of translation information among TLBs, which improves system performance and constrains much of the TLB traffic compared to other broadcast-based approaches. It is achieved by requiring only TLBs holding extra tokens provide them along with the page translation (about one response per TLB miss). TokenTLB effectively increases blocks classified as private up to 61.1% while allowing read-only detection (24.4% shared-read-only blocks). When TokenTLB is applied to optimize the directory, it reduces the dynamic energy consumed by the cache hierarchy by nearly 27.3% over the baseline.

References

[1]

Advanced Micro Devices. http://www.amd.com. {Online; accessed Jan-2016}.

[2]

ARM holdings plc. http://www.arm.com. {Online; accessed Jan-2016}.

[3]

Intel Corporation. http://www.intel.com. {Online; accessed Jan-2016}.

[4]

M. E. Acacio, J. González, J. M. García, and J. Duato. Owner prediction for accelerating cache-to-cache transfer misses in cc-NUMA multiprocessors. In ACM/IEEE Conf. on Supercomputing (SC), pages 1--12, Nov. 2002.

Digital Library

[5]

N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In IEEE Int'l Symp. on Performance Analysis of Systems and Software (ISPASS), pages 33--42, Apr. 2009.

[6]

N. Agarwal, L.-S. Peh, and N. K. Jha. In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. In 15th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 67--78, Feb. 2009.

[7]

A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating non-deterministic multi-threaded commercial workloads. In 5th Workshop On Computer Architecture Evaluation using Commercial Workloads (CAECW), pages 30--38, Feb. 2002.

[8]

M. Alisafaee. Spatiotemporal coherence tracking. In 45th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 341--350, Dec. 2012.

Digital Library

[9]

T. W. Barr, A. L. Cox, and S. Rixner. Spectlb: A mechanism for speculative address translation. In 38th Int'l Symp. on Computer Architecture (ISCA), pages 307--318, June 2011.

Digital Library

[10]

A. Bhattacharjee, D. Lustig, and M. Martonosi. Shared last-level tlbs for chip multiprocessors. In 17th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 62--73, Feb. 2011.

Digital Library

[11]

A. Bhattacharjee and M. Martonosi. Inter-core cooperative tlb for chip multiprocessors. In 15th Int'l Conf. on Architectural Support for Programming Language and Operating Systems (ASPLOS), pages 359--370, Mar. 2010.

Digital Library

[12]

C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In 17th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008.

Digital Library

[13]

J. F. Cantin, J. E. Smith, M. H. Lipasti, A. Moshovos, and B. Falsafi. Coarse-grain coherence tracking: RegionScout and region coherence arrays. IEEE Micro, 26(1):70--79, Jan. 2006.

Digital Library

[14]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 38th Int'l Symp. on Computer Architecture (ISCA), pages 93--103, June 2011.

Digital Library

[15]

B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by avoiding the tracking of non-coherent memory blocks. IEEE Transactions on Computers (TC), 62(3):482--495, Mar. 2013.

Digital Library

[16]

M. Davari, A. Ros, E. Hagersten, and S. Kaxiras. An efficient, self-contained, on-chip, directory: DIR1-SISD. In 24th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 317--330, Oct. 2015.

Digital Library

[17]

S. Demetriades and S. Cho. Stash directory: A scalable directory for many-core coherence. In 20th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 177--188, Feb. 2014.

[18]

A. Esteve, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Efficient tlb-based detection of private pages in chip multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS), 27(3):748--761, Mar. 2016.

Digital Library

[19]

M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo directory: A scalable directory for many-core systems. In 17th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 169--180, Feb. 2011.

Digital Library

[20]

N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In 36th Int'l Symp. on Computer Architecture (ISCA), pages 184--195, June 2009.

Digital Library

[21]

H. Hossain, S. Dwarkadas, and M. C. Huang. POPS: Coherence protocol optimization for both private and shared data. In 20th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 45--55, Oct. 2011.

Digital Library

[22]

S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In 28th Int'l Symp. on Computer Architecture (ISCA), pages 240--251, June 2001.

Digital Library

[23]

D. Kim, J. Ahn, J. Kim, and J. Huh. Subspace snooping: Filtering snoops with operating system support. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 111--122, Sept. 2010.

Digital Library

[24]

K. Koukos, A. Ros, E. Hagersten, and S. Kaxiras. Building heterogeneous unified virtual memories (uvms) without the overhead. ACM Transactions on Architecture and Code Optimization (TACO), 13(1):1:1--1:22, Mar. 2016.

Digital Library

[25]

M.-L. Li, R. Sasanka, S. V. Adve, Y.-K. Chen, and E. Debes. The ALPBench benchmark suite for complex multimedia applications. In Int'l Symp. on Workload Characterization (IISWC), pages 34--45, Oct. 2005.

[26]

Y. Li, A. Abousamra, R. Melhem, and A. K. Jones. Compiler-assisted data distribution for chip multiprocessors. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 501--512, Sept. 2010.

Digital Library

[27]

Y. Li, R. Melhem, and A. K. Jones. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future cmps. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):28:1--25:21, Jan. 2013.

Digital Library

[28]

Y. Li, R. G. Melhem, and A. K. Jones. Practically private: Enabling high performance cmps through compiler-assisted data classification. In 21st Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 231--240, Sept. 2012.

Digital Library

[29]

P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002.

Digital Library

[30]

M. M. Martin. Token Coherence. PhD thesis, University of Wisconsin-Madison, Dec. 2003.

[31]

M. M. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In 30th Int'l Symp. on Computer Architecture (ISCA), pages 206--217, June 2003.

Digital Library

[32]

M. M. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In 30th Int'l Symp. on Computer Architecture (ISCA), pages 182--193, June 2003.

Digital Library

[33]

M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 33(4):92--99, Sept. 2005.

Digital Library

[34]

M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. Martin, and D. A. Wood. Improving multiple-CMP systems using token coherence. In 11th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 328--339, Feb. 2005.

Digital Library

[35]

M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos. Prediction-based superpage-friendly tlb designs. In 21th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 210--222, Feb. 2015.

[36]

S. H. Pugsley, J. B. Spjut, D. W. Nellans, and R. Balasubramonian. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 465--476, Sept. 2010.

Digital Library

[37]

A. Raghavan, C. Blundell, and M. M. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In 41th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 47--58, Nov. 2008.

Digital Library

[38]

B. F. Romanescu, A. R. Lebeck, D. J. Sorin, and A. Bracy. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In 16th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 1--12, Feb. 2010.

[39]

A. Ros, M. E. Acacio, and J. M. García. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In 22nd Int'l Parallel and Distributed Processing Symp. (IPDPS), pages 1--11, Apr. 2008.

[40]

A. Ros, M. E. Acacio, and J. M. García. Dealing with traffic-area trade-off in direct coherence protocols for many-core cmps. In 8th Int'l Conf. on Advanced Parallel Processing Technologies (APPT), pages 11--27, Aug. 2009.

Digital Library

[41]

A. Ros, B. Cuesta, R. Fernández-Pascual, M. E. Gómez, M. E. Acacio, A. Robles, J. M. García, and J. Duato. EMC2: Extending magny-cours coherence for large-scale servers. In 17th Int'l Conf. on High Performance Computing (HiPC), pages 1--10, Dec. 2010.

[42]

A. Ros, B. Cuesta, M. E. Gómez, A. Robles, and J. Duato. Temporal-aware mechanism to detect private data in chip multiprocessors. In 42nd Int'l Conf. on Parallel Processing (ICPP), pages 562--571, Oct. 2013.

Digital Library

[43]

A. Ros, M. Davari, and S. Kaxiras. Hierarchical private/shared classification: the key to simple and efficient coherence for clustered cache hierarchies. In 21th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 186--197, Feb. 2015.

[44]

A. Ros and A. Jimborean. A dual-consistency cache coherence protocol. In 29th Int'l Parallel and Distributed Processing Symp. (IPDPS), pages 1119--1128, May 2015.

Digital Library

[45]

A. Ros and A. Jimborean. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Transactions on Parallel and Distributed Systems (TPDS), PP(99), Feb. 2016.

Digital Library

[46]

A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In 21st Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 241--252, Sept. 2012.

Digital Library

[47]

D. Sanchez and C. Kozyrakis. SCD: A scalable coherence directory with flexible sharer set encoding. In 18th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 129--140, Feb. 2012.

Digital Library

[48]

S. Srikantaiah and M. Kandemir. Synergistic tlbs for high performance address translation in chip multiprocessors. In 43rd IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 313--324, Dec. 2010.

Digital Library

[49]

S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. Cacti 5.1. Technical Report HPL-2008-20, HP Labs, Apr. 2008.

[50]

C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez, A. Mendelson, N. Navarro, A. Cristal, and O. S. Unsal. DiDi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In 20th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 340--349, Oct. 2011.

Digital Library

[51]

S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd Int'l Symp. on Computer Architecture (ISCA), pages 24--36, June 1995.

Digital Library

[52]

J. Zebchuk, B. Falsafi, and A. Moshovos. Multi-grain coherence directories. In 46th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 359--370, Dec. 2013.

Digital Library

[53]

J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In 42nd IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 423--434, Dec. 2009.

Digital Library

Cited By

Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2022.09.004
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Upadhyay BRos AShah J(2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
https://doi.org/10.1016/j.jpdc.2021.07.005
Show More Cited By

Recommendations

Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Upadhyay BRos AM. S(2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.jpdc.2022.09.004
Caheny PAlvarez LCasas MMoreto M(2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
https://doi.org/10.1109/SC41404.2022.00085
Upadhyay BRos AShah J(2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
https://doi.org/10.1016/j.jpdc.2021.07.005
Upadhyay BRos ANS M(2020)TLB-based Block-Grain Classification of Private Data2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00025(122-130)Online publication date: Mar-2020
https://doi.org/10.1109/PDP50117.2020.00025
Caheny PAlvarez LValero MMoretó MCasas M(2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291703(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.5555/3291656.3291703
Soltaniyeh MKadayif IOzturk O(2018)Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.272928037:4(806-819)Online publication date: 1-Apr-2018
https://dl.acm.org/doi/10.1109/TCAD.2017.2729280
Caheny PAlvarez LValero MMoretó MCasas M(2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00038(1-12)Online publication date: 11-Nov-2018
https://dl.acm.org/doi/10.1109/SC.2018.00038
Ho NAshraf IKaufmann PPlatzner M(2017)Accurate private/shared classification of memory accessesProceedings of the Conference on Design, Automation & Test in Europe10.5555/3130379.3130570(788-793)Online publication date: 27-Mar-2017
https://dl.acm.org/doi/10.5555/3130379.3130570
Ho NAshraf IKaufmann PPlatzner M(2017)Accurate private/shared classification of memory accesses: A run-time analysis system for the LEON3 multi-core processorDesign, Automation & Test in Europe Conference & Exhibition (DATE), 201710.23919/DATE.2017.7927096(788-793)Online publication date: Mar-2017
https://doi.org/10.23919/DATE.2017.7927096
Tsai PBeckmann NSanchez D(2017)Nexus: A New Approach to Replication in Distributed Shared Caches2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.42(166-179)Online publication date: Sep-2017
https://doi.org/10.1109/PACT.2017.42
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten