skip to main content
research-article

A scalable processing-in-memory accelerator for parallel graph processing

Published:13 June 2015Publication History
Skip Abstract Section

Abstract

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations.

In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

References

  1. ARM Cortex-A5 Processor. Available: http://www.arm.com/products/processors/cortex-a/cortex-a5.phpGoogle ScholarGoogle Scholar
  2. R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  3. A. Basu et al., "Efficient virtual memory for big memory servers," in Proc. ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39--59, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," in Proc. WWW, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," IEEE Trans. Comput., vol. 44, no. 5, pp. 609--623, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. S. Chung et al., "LINQits: Big data on little clients," in Proc. ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Dagum and R. Menon, "OpenMP: An industry-standard API for shared-memory programming," IEEE Comput. Sci. & Eng., vol. 5, no. 1, pp. 46--55, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Y. Eckert et al., "Thermal feasibility of die-stacked processing in memory," in WoNDP, 2014.Google ScholarGoogle Scholar
  10. M. Ferdman et al., "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Proc. ASPLOS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. E. Gonzalez et al., "PowerGraph: Distributed graph-parallel computation on natural graphs," in Proc. OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Gutierrez et al., "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in Proc. ASPLOS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in Proc. HiPC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Harshvardhan et al., "KLA: A new algorithmic paradigm for parallel graph computations," in Proc. PACT, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Hong et al., "Accelerating CUDA graph algorithms at maximum warp," in Proc. PPoPP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," J. Parallel Distrib. Comput., vol. 65, no. 4, pp. 448--463, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. "Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.Google ScholarGoogle Scholar
  23. "Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.Google ScholarGoogle Scholar
  24. J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  25. N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. Kgil et al., "PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proc. ASPLOS, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in Proc. MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Kroft, "Lockup-free instruction fetch/prefetch cache organization," in Proc. ISCA, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.phpGoogle ScholarGoogle Scholar
  34. K. Lim et al., "Thin servers with smart pipes: Designing SoC accelerators for memcached," in Proc. ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," in WoNDP, 2013.Google ScholarGoogle Scholar
  37. Y. Low et al., "Distributed GraphLab: A framework for machine learning and data mining in the cloud," Proc. VLDB Endow., vol. 5, no. 8, pp. 716--727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. G. Malewicz et al., "Pregel: A system for large-scale graph processing," in Proc. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. D. Merrill et al., "Scalable GPU graph traversal," in Proc. PPoPP, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. 2Gb: x4, x8, x16 DDR3 SDRAM, Micron Technology, 2006.Google ScholarGoogle Scholar
  42. A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. O. Mutlu et al., "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in Proc. HPCA, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Oracle TimesTen in-memory database. Available: http://www.oracle.com/technetwork/database/timesten/Google ScholarGoogle Scholar
  45. M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Ousterhout et al., "The case for RAMClouds: Scalable high-performance storage entirely in DRAM," ACM SIGOPS Oper. Syst. Rev., vol. 43, no. 4, pp. 92--105, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. D. Patterson et al., "Intelligent RAM (IRAM): Chips that remember and compute," in ISSCC Dig. Tech. Pap., 1997.Google ScholarGoogle Scholar
  48. S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  49. W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in Proc. ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. P. Ranganathan, "From microprocessors to Nanostores: Rethinking data-centric systems," IEEE Comput., vol. 44, no. 1, pp. 39--48, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. Salihoglu and J. Widom, "GPS: A graph processing system," in Proc. SSDBM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. SAP HANA. Available: http://www.saphana.com/Google ScholarGoogle Scholar
  53. V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. M. Shevgoor et al., "Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device," in Proc. MICRO, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Y. Solihin et al., "Using a user-level memory thread for correlation prefetching," in Proc. ISCA, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in Proc. HPCA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. M. A. Suleman et al., "Accelerating critical section execution with asymmetric multi-core architectures," in Proc. ASPLOS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Y. Tian et al., "From "think like a vertex" to "think like a graph"," Proc. VLDB Endow., vol. 7, no. 3, pp. 193--204, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in Proc. ISCA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. C.-L. Yang and A. R. Lebeck, "Push vs. pull: Data movement for linked data structures," in Proc. ICS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.Google ScholarGoogle Scholar
  63. Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in Proc. HPEC, 2013.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A scalable processing-in-memory accelerator for parallel graph processing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM SIGARCH Computer Architecture News
          ACM SIGARCH Computer Architecture News  Volume 43, Issue 3S
          ISCA'15
          June 2015
          745 pages
          ISSN:0163-5964
          DOI:10.1145/2872887
          Issue’s Table of Contents
          • cover image ACM Conferences
            ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
            June 2015
            768 pages
            ISBN:9781450334020
            DOI:10.1145/2749469

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 June 2015

          Check for updates

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader