research-article

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

Authors:
Pirmin Vogel

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland
View Profile

,
Andreas Kurth

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland
View Profile

,
Johannes Weinbuch

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland

Integrated Systems Laboratory, ETH Zurich, Zurich, Switzerland
View Profile

,
Andrea Marongiu

Integrated Systems Laboratory, ETH Zurich and Electrical, Electronic, and Information Engineering Department, University of Bologna, Bologna, Italy

Integrated Systems Laboratory, ETH Zurich and Electrical, Electronic, and Information Engineering Department, University of Bologna, Bologna, Italy
View Profile

,
Luca Benini

Integrated Systems Laboratory, ETH Zurich and Electrical, Electronic, and Information Engineering Department, University of Bologna, Bologna, Italy

Integrated Systems Laboratory, ETH Zurich and Electrical, Electronic, and Information Engineering Department, University of Bologna, Bologna, Italy
View Profile

Authors Info & Claims

ACM Transactions on Embedded Computing Systems Volume 16 Issue 5sArticle No.: 154pp 1–19https://doi.org/10.1145/3126560

Published:27 September 2017Publication History

ACM Transactions on Embedded Computing Systems

Abstract

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads.

In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.

References

Adapteva Inc. Parallela Reference Manual. Technical reference manual. (2014).Google Scholar
AMD Inc. AMD Compute Cores. White Paper. (2014). www.amd.com/documents/compute_cores_whitepaper.pdf.Google Scholar
ARM Ltd. Cortex-A9 Floating-Point Unit. Technical reference manual. (2012).Google Scholar
ARM Ltd. AMBA AXI and ACE Protocol Specification. Protocol specification. (2013).Google Scholar
ARM Ltd. ARM CoreLink MMU-500 System Memory Management Unit. Technical reference manual. (2016).Google Scholar
S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. In IW3C-7. 107--117. Google ScholarDigital Library
Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. 109:1--109:6. Google ScholarDigital Library
J. Cong, Z. Fang, Y. Hao, and G. Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23. 37--48.Google Scholar
J. Corbet. Fixing the contiguous memory allocator. LWN article. (2015). http://lwn.net/Articles/636234/.Google Scholar
J. Gall and V. Lempitsky. 2009. Class-specific hough forests for object detection. In CVPR-27. 1022--1029.Google Scholar
Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In IPDPS-28. 395--404. Google ScholarDigital Library
HSA Foundation. HSA Foundation. (2012). www.hsafoundation.com.Google Scholar
K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. 2016. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In ICCD-34. 25--32.Google Scholar
Intel Corp. The compute architecture of Intel Processor Graphics Gen9. White Paper. (2015). https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf.Google Scholar
Intel Corp. Arria 10 Device Overview. Product Specification. (2016).Google Scholar
Kalray S. A. MPPA MANYCORE. (2014).Google Scholar
G. Kornaros, K. Harteros, I. Christoforakis, and M. Astrinaki. 2014. I/O virtualization utilizing an efficient hardware system-level memory management unit. In ISSoC’14. 1--4.Google Scholar
A. Kurth, A. Tretter, P. A. Hager, S. Sanabria, O. Göksel, L. Thiele, and L. Benini. 2016. Mobile ultrasound imaging on heterogeneous multi-core platforms. In ESTIMedia-14. 9--18. Google ScholarDigital Library
M. Lavasani, H. Angepat, and D. Chiou. 2014. An FPGA-based in-line accelerator for memcached. IEEE CAL 13, 2 (2014), 57--60. Google ScholarDigital Library
Y. Li, R. Melhem, and A. K. Jones. 2013. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs. ACM TACO 9, 4 (2013), 28:1--28:21. Google ScholarDigital Library
P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni. 2016. Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip. In CASES’16. 3:1--3:10. Google ScholarDigital Library
D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs. In DAC-49. 1137--1142. Google ScholarDigital Library
M. Nazarewicz. A deep dive into CMA. LWN article. (2012). http://lwn.net/Articles/486301/.Google Scholar
S. Park, M. Kim, and H. Y. Yeom. 2016. GCMA: Guaranteed contiguous memory allocator. SIGBED Rev. 13, 1 (2016), 29--34. Google ScholarDigital Library
O. Peleg, A. Morrison, B. Serebrin, and D. Tsafrir. 2015. Utilizing the IOMMU scalably. In USENIX ATC’15. 549--562. Google ScholarDigital Library
B. Pichai, L. Hsu, and A. Bhattacharjee. 2014. Architectural support for address translation on GPUs. In ASPLOS-19. 743--758. Google ScholarDigital Library
J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA-20. 568--578.Google Scholar
D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. 2014. Energy efficient parallel computing on the PULP platform with support for OpenMP. In ICEEEI-28. 1--5.Google Scholar
J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1--7:7.Google ScholarDigital Library
P. Viola and M. Jones. 2004. Robust real-time face detection. IJCV 57, 2 (2004), 137--154. Google ScholarDigital Library
P. Vogel, A. Marongiu, and L. Benini. 2017. Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs. IEEE TPDS 28, 7 (2017), 1947--1959.Google Scholar
Xilinx Inc. Zynq-7000 All Programmable SoC Overview. Product Specification. (2016).Google Scholar
Xilinx Inc. SDSoC Environment User Guide. User Guide. (2017). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug1027-sdsoc-user-guide.pdf.Google Scholar
Xilinx Inc. Zynq UltraScale+ MPSoC Data Sheet: Overview. Advance Product Specification. (2017).Google Scholar

Index Terms

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
  2. Embedded and cyber-physical systems
    1. Embedded systems
      1. Embedded software
    2. System on a chip
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Main memory
        Virtual memory

Recommendations

HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems
ANDARE '18: Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC Systems

Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine "standard platform" software support (e.g. the Linux OS) with energy-efficient, domain-specific, ...
Read More
Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs
CODES '15: Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis

While high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA) as envisioned by the Heterogeneous System Architecture (HSA) foundation, their low-power counterparts targeting the embedded domain still lack ...
Read More
An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores

Today's systems-on-chip (SoCs) more and more conform to the models envisioned by the Heterogeneous System Architecture (HSA) foundation in which massively parallel, programmable many-core accelerators (PMCAs) not only cooperate but also coherently share ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Embedded Computing Systems Volume 16, Issue 5s
Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
October 2017
1448 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3145508
Editor:
Sandeep K. Shukla
Indian Institute of Technology, India
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States

Journal Family
ACM Journals for the Design of Smart and Connected Systems
Publication History
- Published: 27 September 2017
- Revised: 1 June 2017
- Accepted: 1 June 2017
- Received: 1 April 2017
Published in tecs Volume 16, Issue 5s

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Linux
Shared virtual memory
TLB management
embedded systems
heterogeneous SoCs
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 496
  Total Downloads
- Downloads (Last 12 months)55
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

ACM Transactions on Embedded Computing Systems

Abstract

References

Cited By

Index Terms

Recommendations

HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems

Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs

An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators