skip to main content
research-article

Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

Authors Info & Claims
Published:27 September 2017Publication History
Skip Abstract Section

Abstract

Shared virtual memory is key in heterogeneous systems on chip (SoCs) that combine a general-purpose host processor with a many-core accelerator, both for programmability and performance. In contrast to the full-blown, hardware-only solutions predominant in modern high-end systems, lightweight hardware-software co-designs are better suited in the context of more power- and area-constrained embedded systems and provide additional benefits in terms of flexibility and predictability. As a downside, the latter solutions require the host to handle in software synchronization in case of page misses as well as miss handling. This may incur considerable run-time overheads.

In this work, we present a novel hardware-software virtual memory management approach for many-core accelerators in heterogeneous embedded SoCs. It exploits an accelerator-side helper thread concept that enables the accelerator to manage its virtual memory hardware autonomously while operating cache-coherently on the page tables of the user-space processes of the host. This greatly reduces overhead with respect to host-side solutions while retaining flexibility. We have validated the design with a set of parameterizable benchmarks and real-world applications covering various application domains. For purely memory-bound kernels, the accelerator performance improves by a factor of 3.8 compared with host-based management and lies within 50% of a lower-bound ideal memory management unit.

References

  1. Adapteva Inc. Parallela Reference Manual. Technical reference manual. (2014).Google ScholarGoogle Scholar
  2. AMD Inc. AMD Compute Cores. White Paper. (2014). www.amd.com/documents/compute_cores_whitepaper.pdf.Google ScholarGoogle Scholar
  3. ARM Ltd. Cortex-A9 Floating-Point Unit. Technical reference manual. (2012).Google ScholarGoogle Scholar
  4. ARM Ltd. AMBA AXI and ACE Protocol Specification. Protocol specification. (2013).Google ScholarGoogle Scholar
  5. ARM Ltd. ARM CoreLink MMU-500 System Memory Management Unit. Technical reference manual. (2016).Google ScholarGoogle Scholar
  6. S. Brin and L. Page. 1998. The anatomy of a large-scale hypertextual web search engine. In IW3C-7. 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. 2016. A quantitative analysis on microarchitectures of modern CPU-FPGA platforms. In DAC-53. 109:1--109:6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Cong, Z. Fang, Y. Hao, and G. Reinman. 2017. Supporting address translation for accelerator-centric architectures. In HPCA-23. 37--48.Google ScholarGoogle Scholar
  9. J. Corbet. Fixing the contiguous memory allocator. LWN article. (2015). http://lwn.net/Articles/636234/.Google ScholarGoogle Scholar
  10. J. Gall and V. Lempitsky. 2009. Class-specific hough forests for object detection. In CVPR-27. 1022--1029.Google ScholarGoogle Scholar
  11. Y. Guo, M. Biczak, A. L. Varbanescu, A. Iosup, C. Martella, and T. L. Willke. 2014. How well do graph-processing platforms perform? An empirical performance evaluation and analysis. In IPDPS-28. 395--404. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. HSA Foundation. HSA Foundation. (2012). www.hsafoundation.com.Google ScholarGoogle Scholar
  13. K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. 2016. Accelerating pointer chasing in 3D-stacked memory: Challenges, mechanisms, evaluation. In ICCD-34. 25--32.Google ScholarGoogle Scholar
  14. Intel Corp. The compute architecture of Intel Processor Graphics Gen9. White Paper. (2015). https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf.Google ScholarGoogle Scholar
  15. Intel Corp. Arria 10 Device Overview. Product Specification. (2016).Google ScholarGoogle Scholar
  16. Kalray S. A. MPPA MANYCORE. (2014).Google ScholarGoogle Scholar
  17. G. Kornaros, K. Harteros, I. Christoforakis, and M. Astrinaki. 2014. I/O virtualization utilizing an efficient hardware system-level memory management unit. In ISSoC’14. 1--4.Google ScholarGoogle Scholar
  18. A. Kurth, A. Tretter, P. A. Hager, S. Sanabria, O. Göksel, L. Thiele, and L. Benini. 2016. Mobile ultrasound imaging on heterogeneous multi-core platforms. In ESTIMedia-14. 9--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Lavasani, H. Angepat, and D. Chiou. 2014. An FPGA-based in-line accelerator for memcached. IEEE CAL 13, 2 (2014), 57--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Li, R. Melhem, and A. K. Jones. 2013. PS-TLB: Leveraging page classification information for fast, scalable and efficient translation for future CMPs. ACM TACO 9, 4 (2013), 28:1--28:21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Mantovani, E. G. Cota, C. Pilato, G. Di Guglielmo, and L. P. Carloni. 2016. Handling large data sets for high-performance embedded applications in heterogeneous systems-on-chip. In CASES’16. 3:1--3:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Melpignano, L. Benini, E. Flamand, B. Jego, T. Lepley, G. Haugou, F. Clermidy, and D. Dutoit. 2012. Platform 2012, a many-core computing accelerator for embedded SoCs. In DAC-49. 1137--1142. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Nazarewicz. A deep dive into CMA. LWN article. (2012). http://lwn.net/Articles/486301/.Google ScholarGoogle Scholar
  24. S. Park, M. Kim, and H. Y. Yeom. 2016. GCMA: Guaranteed contiguous memory allocator. SIGBED Rev. 13, 1 (2016), 29--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. O. Peleg, A. Morrison, B. Serebrin, and D. Tsafrir. 2015. Utilizing the IOMMU scalably. In USENIX ATC’15. 549--562. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. B. Pichai, L. Hsu, and A. Bhattacharjee. 2014. Architectural support for address translation on GPUs. In ASPLOS-19. 743--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Power, M. D. Hill, and D. A. Wood. 2014. Supporting x86-64 address translation for 100s of GPU lanes. In HPCA-20. 568--578.Google ScholarGoogle Scholar
  28. D. Rossi, I. Loi, F. Conti, G. Tagliavini, A. Pullini, and A. Marongiu. 2014. Energy efficient parallel computing on the PULP platform with support for OpenMP. In ICEEEI-28. 1--5.Google ScholarGoogle Scholar
  29. J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel. 2015. CAPI: A coherent accelerator processor interface. IBM J. Res. Dev. 59, 1 (2015), 7:1--7:7.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. P. Viola and M. Jones. 2004. Robust real-time face detection. IJCV 57, 2 (2004), 137--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Vogel, A. Marongiu, and L. Benini. 2017. Lightweight virtual memory support for zero-copy sharing of pointer-rich data structures in heterogeneous embedded SoCs. IEEE TPDS 28, 7 (2017), 1947--1959.Google ScholarGoogle Scholar
  32. Xilinx Inc. Zynq-7000 All Programmable SoC Overview. Product Specification. (2016).Google ScholarGoogle Scholar
  33. Xilinx Inc. SDSoC Environment User Guide. User Guide. (2017). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug1027-sdsoc-user-guide.pdf.Google ScholarGoogle Scholar
  34. Xilinx Inc. Zynq UltraScale+ MPSoC Data Sheet: Overview. Advance Product Specification. (2017).Google ScholarGoogle Scholar

Index Terms

  1. Efficient Virtual Memory Sharing via On-Accelerator Page Table Walking in Heterogeneous Embedded SoCs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Embedded Computing Systems
          ACM Transactions on Embedded Computing Systems  Volume 16, Issue 5s
          Special Issue ESWEEK 2017, CASES 2017, CODES + ISSS 2017 and EMSOFT 2017
          October 2017
          1448 pages
          ISSN:1539-9087
          EISSN:1558-3465
          DOI:10.1145/3145508
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 September 2017
          • Revised: 1 June 2017
          • Accepted: 1 June 2017
          • Received: 1 April 2017
          Published in tecs Volume 16, Issue 5s

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader