skip to main content
10.1145/2925426.2926280acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

TokenTLB: A Token-Based Page Classification Approach

Published: 01 June 2016 Publication History

Abstract

Classifying memory accesses into private or shared data has become a fundamental approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor.
This work proposes TokenTLB, a page classification approach based on exchange and count of tokens. The key observation behind our proposal is that, opposed to coherence management, data classification meets all the benefits of a token-based approach without the burden of complex arbitration mechanisms, which has discouraged the implementation of token-based coherence protocols in commodity systems. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two or more TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. TokenTLB also favors shareability of translation information among TLBs, which improves system performance and constrains much of the TLB traffic compared to other broadcast-based approaches. It is achieved by requiring only TLBs holding extra tokens provide them along with the page translation (about one response per TLB miss). TokenTLB effectively increases blocks classified as private up to 61.1% while allowing read-only detection (24.4% shared-read-only blocks). When TokenTLB is applied to optimize the directory, it reduces the dynamic energy consumed by the cache hierarchy by nearly 27.3% over the baseline.

References

[1]
Advanced Micro Devices. http://www.amd.com. {Online; accessed Jan-2016}.
[2]
ARM holdings plc. http://www.arm.com. {Online; accessed Jan-2016}.
[3]
Intel Corporation. http://www.intel.com. {Online; accessed Jan-2016}.
[4]
M. E. Acacio, J. González, J. M. García, and J. Duato. Owner prediction for accelerating cache-to-cache transfer misses in cc-NUMA multiprocessors. In ACM/IEEE Conf. on Supercomputing (SC), pages 1--12, Nov. 2002.
[5]
N. Agarwal, T. Krishna, L.-S. Peh, and N. K. Jha. GARNET: A detailed on-chip network model inside a full-system simulator. In IEEE Int'l Symp. on Performance Analysis of Systems and Software (ISPASS), pages 33--42, Apr. 2009.
[6]
N. Agarwal, L.-S. Peh, and N. K. Jha. In-Network Snoop Ordering (INSO): Snoopy coherence on unordered interconnects. In 15th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 67--78, Feb. 2009.
[7]
A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating non-deterministic multi-threaded commercial workloads. In 5th Workshop On Computer Architecture Evaluation using Commercial Workloads (CAECW), pages 30--38, Feb. 2002.
[8]
M. Alisafaee. Spatiotemporal coherence tracking. In 45th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 341--350, Dec. 2012.
[9]
T. W. Barr, A. L. Cox, and S. Rixner. Spectlb: A mechanism for speculative address translation. In 38th Int'l Symp. on Computer Architecture (ISCA), pages 307--318, June 2011.
[10]
A. Bhattacharjee, D. Lustig, and M. Martonosi. Shared last-level tlbs for chip multiprocessors. In 17th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 62--73, Feb. 2011.
[11]
A. Bhattacharjee and M. Martonosi. Inter-core cooperative tlb for chip multiprocessors. In 15th Int'l Conf. on Architectural Support for Programming Language and Operating Systems (ASPLOS), pages 359--370, Mar. 2010.
[12]
C. Bienia, S. Kumar, J. P. Singh, and K. Li. The PARSEC benchmark suite: Characterization and architectural implications. In 17th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 72--81, Oct. 2008.
[13]
J. F. Cantin, J. E. Smith, M. H. Lipasti, A. Moshovos, and B. Falsafi. Coarse-grain coherence tracking: RegionScout and region coherence arrays. IEEE Micro, 26(1):70--79, Jan. 2006.
[14]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by deactivating coherence for private memory blocks. In 38th Int'l Symp. on Computer Architecture (ISCA), pages 93--103, June 2011.
[15]
B. Cuesta, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Increasing the effectiveness of directory caches by avoiding the tracking of non-coherent memory blocks. IEEE Transactions on Computers (TC), 62(3):482--495, Mar. 2013.
[16]
M. Davari, A. Ros, E. Hagersten, and S. Kaxiras. An efficient, self-contained, on-chip, directory: DIR1-SISD. In 24th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 317--330, Oct. 2015.
[17]
S. Demetriades and S. Cho. Stash directory: A scalable directory for many-core coherence. In 20th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 177--188, Feb. 2014.
[18]
A. Esteve, A. Ros, M. E. Gómez, A. Robles, and J. Duato. Efficient tlb-based detection of private pages in chip multiprocessors. IEEE Transactions on Parallel and Distributed Systems (TPDS), 27(3):748--761, Mar. 2016.
[19]
M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo directory: A scalable directory for many-core systems. In 17th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 169--180, Feb. 2011.
[20]
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In 36th Int'l Symp. on Computer Architecture (ISCA), pages 184--195, June 2009.
[21]
H. Hossain, S. Dwarkadas, and M. C. Huang. POPS: Coherence protocol optimization for both private and shared data. In 20th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 45--55, Oct. 2011.
[22]
S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: Exploiting generational behavior to reduce cache leakage power. In 28th Int'l Symp. on Computer Architecture (ISCA), pages 240--251, June 2001.
[23]
D. Kim, J. Ahn, J. Kim, and J. Huh. Subspace snooping: Filtering snoops with operating system support. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 111--122, Sept. 2010.
[24]
K. Koukos, A. Ros, E. Hagersten, and S. Kaxiras. Building heterogeneous unified virtual memories (uvms) without the overhead. ACM Transactions on Architecture and Code Optimization (TACO), 13(1):1:1--1:22, Mar. 2016.
[25]
M.-L. Li, R. Sasanka, S. V. Adve, Y.-K. Chen, and E. Debes. The ALPBench benchmark suite for complex multimedia applications. In Int'l Symp. on Workload Characterization (IISWC), pages 34--45, Oct. 2005.
[26]
Y. Li, A. Abousamra, R. Melhem, and A. K. Jones. Compiler-assisted data distribution for chip multiprocessors. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 501--512, Sept. 2010.
[27]
Y. Li, R. Melhem, and A. K. Jones. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future cmps. ACM Transactions on Architecture and Code Optimization (TACO), 9(4):28:1--25:21, Jan. 2013.
[28]
Y. Li, R. G. Melhem, and A. K. Jones. Practically private: Enabling high performance cmps through compiler-assisted data classification. In 21st Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 231--240, Sept. 2012.
[29]
P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. IEEE Computer, 35(2):50--58, Feb. 2002.
[30]
M. M. Martin. Token Coherence. PhD thesis, University of Wisconsin-Madison, Dec. 2003.
[31]
M. M. Martin, P. J. Harper, D. J. Sorin, M. D. Hill, and D. A. Wood. Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors. In 30th Int'l Symp. on Computer Architecture (ISCA), pages 206--217, June 2003.
[32]
M. M. Martin, M. D. Hill, and D. A. Wood. Token coherence: Decoupling performance and correctness. In 30th Int'l Symp. on Computer Architecture (ISCA), pages 182--193, June 2003.
[33]
M. M. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. Computer Architecture News, 33(4):92--99, Sept. 2005.
[34]
M. R. Marty, J. D. Bingham, M. D. Hill, A. J. Hu, M. M. Martin, and D. A. Wood. Improving multiple-CMP systems using token coherence. In 11th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 328--339, Feb. 2005.
[35]
M.-M. Papadopoulou, X. Tong, A. Seznec, and A. Moshovos. Prediction-based superpage-friendly tlb designs. In 21th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 210--222, Feb. 2015.
[36]
S. H. Pugsley, J. B. Spjut, D. W. Nellans, and R. Balasubramonian. SWEL: Hardware cache coherence protocols to map shared data onto shared caches. In 19th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 465--476, Sept. 2010.
[37]
A. Raghavan, C. Blundell, and M. M. Martin. Token tenure: PATCHing token counting using directory-based cache coherence. In 41th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 47--58, Nov. 2008.
[38]
B. F. Romanescu, A. R. Lebeck, D. J. Sorin, and A. Bracy. UNified instruction/translation/data (UNITD) coherence: One protocol to rule them all. In 16th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 1--12, Feb. 2010.
[39]
A. Ros, M. E. Acacio, and J. M. García. DiCo-CMP: Efficient cache coherency in tiled CMP architectures. In 22nd Int'l Parallel and Distributed Processing Symp. (IPDPS), pages 1--11, Apr. 2008.
[40]
A. Ros, M. E. Acacio, and J. M. García. Dealing with traffic-area trade-off in direct coherence protocols for many-core cmps. In 8th Int'l Conf. on Advanced Parallel Processing Technologies (APPT), pages 11--27, Aug. 2009.
[41]
A. Ros, B. Cuesta, R. Fernández-Pascual, M. E. Gómez, M. E. Acacio, A. Robles, J. M. García, and J. Duato. EMC2: Extending magny-cours coherence for large-scale servers. In 17th Int'l Conf. on High Performance Computing (HiPC), pages 1--10, Dec. 2010.
[42]
A. Ros, B. Cuesta, M. E. Gómez, A. Robles, and J. Duato. Temporal-aware mechanism to detect private data in chip multiprocessors. In 42nd Int'l Conf. on Parallel Processing (ICPP), pages 562--571, Oct. 2013.
[43]
A. Ros, M. Davari, and S. Kaxiras. Hierarchical private/shared classification: the key to simple and efficient coherence for clustered cache hierarchies. In 21th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 186--197, Feb. 2015.
[44]
A. Ros and A. Jimborean. A dual-consistency cache coherence protocol. In 29th Int'l Parallel and Distributed Processing Symp. (IPDPS), pages 1119--1128, May 2015.
[45]
A. Ros and A. Jimborean. A hybrid static-dynamic classification for dual-consistency cache coherence. IEEE Transactions on Parallel and Distributed Systems (TPDS), PP(99), Feb. 2016.
[46]
A. Ros and S. Kaxiras. Complexity-effective multicore coherence. In 21st Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 241--252, Sept. 2012.
[47]
D. Sanchez and C. Kozyrakis. SCD: A scalable coherence directory with flexible sharer set encoding. In 18th Int'l Symp. on High-Performance Computer Architecture (HPCA), pages 129--140, Feb. 2012.
[48]
S. Srikantaiah and M. Kandemir. Synergistic tlbs for high performance address translation in chip multiprocessors. In 43rd IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 313--324, Dec. 2010.
[49]
S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. Cacti 5.1. Technical Report HPL-2008-20, HP Labs, Apr. 2008.
[50]
C. Villavieja, V. Karakostas, L. Vilanova, Y. Etsion, A. Ramirez, A. Mendelson, N. Navarro, A. Cristal, and O. S. Unsal. DiDi: Mitigating the performance impact of tlb shootdowns using a shared tlb directory. In 20th Int'l Conf. on Parallel Architectures and Compilation Techniques (PACT), pages 340--349, Oct. 2011.
[51]
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations. In 22nd Int'l Symp. on Computer Architecture (ISCA), pages 24--36, June 1995.
[52]
J. Zebchuk, B. Falsafi, and A. Moshovos. Multi-grain coherence directories. In 46th IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 359--370, Dec. 2013.
[53]
J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In 42nd IEEE/ACM Int'l Symp. on Microarchitecture (MICRO), pages 423--434, Dec. 2009.

Cited By

View all
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • (2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Data classification
  2. Private-shared
  3. Read-only data
  4. TLB
  5. Token counting

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)2
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Fine-grain data classification to filter token coherence trafficJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.09.004171:C(40-53)Online publication date: 1-Jan-2023
  • (2022)TD-NUCA: Runtime Driven Management of NUCA Caches in Task Dataflow Programming ModelsSC22: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41404.2022.00085(1-15)Online publication date: Nov-2022
  • (2021)Efficient classification of private memory blocksJournal of Parallel and Distributed Computing10.1016/j.jpdc.2021.07.005Online publication date: Jul-2021
  • (2020)TLB-based Block-Grain Classification of Private Data2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)10.1109/PDP50117.2020.00025(122-130)Online publication date: Mar-2020
  • (2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.5555/3291656.3291703(1-12)Online publication date: 11-Nov-2018
  • (2018)Classifying Data Blocks at Subpage Granularity With an On-Chip Page Table to Improve Coherence in Tiled CMPsIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems10.1109/TCAD.2017.272928037:4(806-819)Online publication date: 1-Apr-2018
  • (2018)Runtime-assisted cache coherence deactivation in task parallel programsProceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis10.1109/SC.2018.00038(1-12)Online publication date: 11-Nov-2018
  • (2017)Accurate private/shared classification of memory accessesProceedings of the Conference on Design, Automation & Test in Europe10.5555/3130379.3130570(788-793)Online publication date: 27-Mar-2017
  • (2017)Accurate private/shared classification of memory accesses: A run-time analysis system for the LEON3 multi-core processorDesign, Automation & Test in Europe Conference & Exhibition (DATE), 201710.23919/DATE.2017.7927096(788-793)Online publication date: Mar-2017
  • (2017)Nexus: A New Approach to Replication in Distributed Shared Caches2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)10.1109/PACT.2017.42(166-179)Online publication date: Sep-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media