Abstract
Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads.
In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.
- Adapteva Inc. Parallela Reference Manual. Technical reference manual. (2014).Google Scholar
- AMD Inc. AMD Compute Cores. White Paper. (2014). www.amd.com/documents/compute_cores_whitepaper.pdf.Google Scholar
- ARM Ltd. Cortex-A9 Floating-Point Unit. Technical reference manual. (2012).Google Scholar
- ARM Ltd. AMBA AXI and ACE Protocol Specification. Protocol specification. (2013).Google Scholar
- ARM Ltd. ARM CoreLink MMU-500 System Memory Management Unit. Technical reference manual. (2016).Google Scholar
- S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. In IW3C-7. 107--117. Google ScholarDigital Library
- Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. 109:1--109:6. Google ScholarDigital Library
- J. Cong, Z. Fang, Y. Hao, and G. Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23. 37--48.Google Scholar
- J. Corbet. Fixing the contiguous memory allocator. LWN article. (2015). http://lwn.net/Articles/636234/.Google Scholar
- J. Gall and V. Lempitsky. 2009. Class-specific hough forests for object detection. In CVPR-27. 1022--1029.Google Scholar
- Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In IPDPS-28. 395--404. Google ScholarDigital Library
- HSA Foundation. HSA Foundation. (2012). www.hsafoundation.com.Google Scholar
- K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. 2016. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In ICCD-34. 25--32.Google Scholar
- Intel Corp. The compute architecture of Intel Processor Graphics Gen9. White Paper. (2015). https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf.Google Scholar
- Intel Corp. Arria 10 Device Overview. Product Specification. (2016).Google Scholar
- Kalray S. A. MPPA MANYCORE. (2014).Google Scholar
- G. Kornaros, K. Harteros, I. Christoforakis, and M. Astrinaki. 2014. I/O virtualization utilizing an efficient hardware system-level memory management unit. In ISSoC’14. 1--4.Google Scholar
- A. Kurth, A. Tretter, P. A. Hager, S. Sanabria, O. Göksel, L. Thiele, and L. Benini. 2016. Mobile ultrasound imaging on heterogeneous multi-core platforms. In ESTIMedia-14. 9--18. Google ScholarDigital Library
- M. Lavasani, H. Angepat, and D. Chiou. 2014. An FPGA-based in-line accelerator for memcached. IEEE CAL 13, 2 (2014), 57--60. Google ScholarDigital Library
- Y. Li, R. Melhem, and A. K. Jones. 2013. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs. ACM TACO 9, 4 (2013), 28:1--28:21. Google ScholarDigital Library
- P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni. 2016. Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip. In CASES’16. 3:1--3:10. Google ScholarDigital Library
- D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs. In DAC-49. 1137--1142. Google ScholarDigital Library
- M. Nazarewicz. A deep dive into CMA. LWN article. (2012). http://lwn.net/Articles/486301/.Google Scholar
- S. Park, M. Kim, and H. Y. Yeom. 2016. GCMA: Guaranteed contiguous memory allocator. SIGBED Rev. 13, 1 (2016), 29--34. Google ScholarDigital Library
- O. Peleg, A. Morrison, B. Serebrin, and D. Tsafrir. 2015. Utilizing the IOMMU scalably. In USENIX ATC’15. 549--562. Google ScholarDigital Library
- B. Pichai, L. Hsu, and A. Bhattacharjee. 2014. Architectural support for address translation on GPUs. In ASPLOS-19. 743--758. Google ScholarDigital Library
- J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA-20. 568--578.Google Scholar
- D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. 2014. Energy efficient parallel computing on the PULP platform with support for OpenMP. In ICEEEI-28. 1--5.Google Scholar
- J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1--7:7.Google ScholarDigital Library
- P. Viola and M. Jones. 2004. Robust real-time face detection. IJCV 57, 2 (2004), 137--154. Google ScholarDigital Library
- P. Vogel, A. Marongiu, and L. Benini. 2017. Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs. IEEE TPDS 28, 7 (2017), 1947--1959.Google Scholar
- Xilinx Inc. Zynq-7000 All Programmable SoC Overview. Product Specification. (2016).Google Scholar
- Xilinx Inc. SDSoC Environment User Guide. User Guide. (2017). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug1027-sdsoc-user-guide.pdf.Google Scholar
- Xilinx Inc. Zynq UltraScale+ MPSoC Data Sheet: Overview. Advance Product Specification. (2017).Google Scholar
Index Terms
- Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs
Recommendations
HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems
ANDARE '18: Proceedings of the 2nd Workshop on AutotuniNg and aDaptivity AppRoaches for Energy efficient HPC SystemsHeterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine "standard platform" software support (e.g. the Linux OS) with energy-efficient, domain-specific, ...
Lightweight virtual memory support for many-core accelerators in heterogeneous embedded SoCs
CODES '15: Proceedings of the 10th International Conference on Hardware/Software Codesign and System SynthesisWhile high-end heterogeneous systems are increasingly supporting heterogeneous uniform memory access (hUMA) as envisioned by the Heterogeneous System Architecture (HSA) foundation, their low-power counterparts targeting the embedded domain still lack ...
An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators
COSMIC '15: Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many CoresToday's systems-on-chip (SoCs) more and more conform to the models envisioned by the Heterogeneous System Architecture (HSA) foundation in which massively parallel, programmable many-core accelerators (PMCAs) not only cooperate but also coherently share ...
Comments