Abstract
The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations.
In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.
- ARM Cortex-A5 Processor. Available: http://www.arm.com/products/processors/cortex-a/cortex-a5.phpGoogle Scholar
- R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.Google ScholarCross Ref
- A. Basu et al., "Efficient virtual memory for big memory servers," in Proc. ISCA, 2013. Google ScholarDigital Library
- A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39--59, 1984. Google ScholarDigital Library
- S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," in Proc. WWW, 1998. Google ScholarDigital Library
- T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," IEEE Trans. Comput., vol. 44, no. 5, pp. 609--623, 1995. Google ScholarDigital Library
- E. S. Chung et al., "LINQits: Big data on little clients," in Proc. ISCA, 2013. Google ScholarDigital Library
- L. Dagum and R. Menon, "OpenMP: An industry-standard API for shared-memory programming," IEEE Comput. Sci. & Eng., vol. 5, no. 1, pp. 46--55, 1998. Google ScholarDigital Library
- Y. Eckert et al., "Thermal feasibility of die-stacked processing in memory," in WoNDP, 2014.Google Scholar
- M. Ferdman et al., "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Proc. ASPLOS, 2012. Google ScholarDigital Library
- M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, 1995. Google ScholarDigital Library
- J. E. Gonzalez et al., "PowerGraph: Distributed graph-parallel computation on natural graphs," in Proc. OSDI, 2012. Google ScholarDigital Library
- A. Gutierrez et al., "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in Proc. ASPLOS, 2014. Google ScholarDigital Library
- M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999. Google ScholarDigital Library
- P. Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in Proc. HiPC, 2007. Google ScholarDigital Library
- Harshvardhan et al., "KLA: A new algorithmic paradigm for parallel graph computations," in Proc. PACT, 2014. Google ScholarDigital Library
- S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012. Google ScholarDigital Library
- S. Hong et al., "Accelerating CUDA graph algorithms at maximum warp," in Proc. PPoPP, 2011. Google ScholarDigital Library
- S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011. Google ScholarDigital Library
- S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014. Google ScholarDigital Library
- C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," J. Parallel Distrib. Comput., vol. 65, no. 4, pp. 448--463, 2005. Google ScholarDigital Library
- "Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.Google Scholar
- "Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.Google Scholar
- J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.Google ScholarCross Ref
- N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA, 1990. Google ScholarDigital Library
- Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999. Google ScholarDigital Library
- G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998. Google ScholarDigital Library
- T. Kgil et al., "PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proc. ASPLOS, 2006. Google ScholarDigital Library
- G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013. Google ScholarDigital Library
- O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in Proc. MICRO, 2013. Google ScholarDigital Library
- P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994. Google ScholarDigital Library
- D. Kroft, "Lockup-free instruction fetch/prefetch cache organization," in Proc. ISCA, 1981. Google ScholarDigital Library
- Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.phpGoogle Scholar
- K. Lim et al., "Thin servers with smart pipes: Designing SoC accelerators for memcached," in Proc. ISCA, 2013. Google ScholarDigital Library
- G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008. Google ScholarDigital Library
- G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," in WoNDP, 2013.Google Scholar
- Y. Low et al., "Distributed GraphLab: A framework for machine learning and data mining in the cloud," Proc. VLDB Endow., vol. 5, no. 8, pp. 716--727, 2012. Google ScholarDigital Library
- C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005. Google ScholarDigital Library
- G. Malewicz et al., "Pregel: A system for large-scale graph processing," in Proc. SIGMOD, 2010. Google ScholarDigital Library
- D. Merrill et al., "Scalable GPU graph traversal," in Proc. PPoPP, 2012. Google ScholarDigital Library
- 2Gb: x4, x8, x16 DDR3 SDRAM, Micron Technology, 2006.Google Scholar
- A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007. Google ScholarDigital Library
- O. Mutlu et al., "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in Proc. HPCA, 2003. Google ScholarDigital Library
- Oracle TimesTen in-memory database. Available: http://www.oracle.com/technetwork/database/timesten/Google Scholar
- M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998. Google ScholarDigital Library
- J. Ousterhout et al., "The case for RAMClouds: Scalable high-performance storage entirely in DRAM," ACM SIGOPS Oper. Syst. Rev., vol. 43, no. 4, pp. 92--105, 2010. Google ScholarDigital Library
- D. Patterson et al., "Intelligent RAM (IRAM): Chips that remember and compute," in ISSCC Dig. Tech. Pap., 1997.Google Scholar
- S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.Google ScholarCross Ref
- W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in Proc. ISCA, 2013. Google ScholarDigital Library
- P. Ranganathan, "From microprocessors to Nanostores: Rethinking data-centric systems," IEEE Comput., vol. 44, no. 1, pp. 39--48, 2011. Google ScholarDigital Library
- S. Salihoglu and J. Widom, "GPS: A graph processing system," in Proc. SSDBM, 2013. Google ScholarDigital Library
- SAP HANA. Available: http://www.saphana.com/Google Scholar
- V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013. Google ScholarDigital Library
- M. Shevgoor et al., "Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device," in Proc. MICRO, 2013. Google ScholarDigital Library
- Y. Solihin et al., "Using a user-level memory thread for correlation prefetching," in Proc. ISCA, 2002. Google ScholarDigital Library
- S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in Proc. HPCA, 2007. Google ScholarDigital Library
- M. A. Suleman et al., "Accelerating critical section execution with asymmetric multi-core architectures," in Proc. ASPLOS, 2009. Google ScholarDigital Library
- Y. Tian et al., "From "think like a vertex" to "think like a graph"," Proc. VLDB Endow., vol. 7, no. 3, pp. 193--204, 2013. Google ScholarDigital Library
- L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in Proc. ISCA, 2013. Google ScholarDigital Library
- C.-L. Yang and A. R. Lebeck, "Push vs. pull: Data movement for linked data structures," in Proc. ICS, 2000. Google ScholarDigital Library
- D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014. Google ScholarDigital Library
- Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.Google Scholar
- Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in Proc. HPEC, 2013.Google ScholarCross Ref
Index Terms
A scalable processing-in-memory accelerator for parallel graph processing
Recommendations
A scalable processing-in-memory accelerator for parallel graph processing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureThe explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad ...
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...
Overcoming the Memory Hierarchy Inefficiencies in Graph Processing Applications
2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD)Graph processing participates a vital role in mining relational data. However, the intensive but inefficient memory accesses make graph processing applications severely bottlenecked by the conventional memory hierarchy. In this work, we focus on ...
Comments