research-article

A scalable processing-in-memory accelerator for parallel graph processing

Authors:

Kiyoung ChoiAuthors Info & Claims

ACM SIGARCH Computer Architecture News, Volume 43, Issue 3S

Pages 105 - 117

https://doi.org/10.1145/2872887.2750386

Published: 13 June 2015 Publication History

Abstract

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad applicability from social science to machine learning. However, scalable hardware design that can efficiently process large graphs in main memory is still an open problem. Ideally, cost-effective and scalable graph processing systems can be realized by building a system whose performance increases proportionally with the sizes of graphs that can be stored in the system, which is extremely challenging in conventional systems due to severe memory bandwidth limitations.

In this work, we argue that the conventional concept of processing-in-memory (PIM) can be a viable solution to achieve such an objective. The key modern enabler for PIM is the recent advancement of the 3D integration technology that facilitates stacking logic and memory dies in a single package, which was not available when the PIM concept was originally examined. In order to take advantage of such a new technology to enable memory-capacity-proportional performance, we design a programmable PIM accelerator for large-scale graph processing called Tesseract. Tesseract is composed of (1) a new hardware architecture that fully utilizes the available memory bandwidth, (2) an efficient method of communication between different memory partitions, and (3) a programming interface that reflects and exploits the unique hardware design. It also includes two hardware prefetchers specialized for memory access patterns of graph processing, which operate based on the hints provided by our programming model. Our comprehensive evaluations using five state-of-the-art graph processing workloads with large real-world graphs show that the proposed architecture improves average system performance by a factor of ten and achieves 87% average energy reduction over conventional systems.

References

[1]

ARM Cortex-A5 Processor. Available: http://www.arm.com/products/processors/cortex-a/cortex-a5.php

[2]

R. Balasubramonian et al., "Near-data processing: Insights from a MICRO-46 workshop," IEEE Micro, vol. 34, no. 4, pp. 36--42, 2014.

[3]

A. Basu et al., "Efficient virtual memory for big memory servers," in Proc. ISCA, 2013.

Digital Library

[4]

A. D. Birrell and B. J. Nelson, "Implementing remote procedure calls," ACM Trans. Comput. Syst., vol. 2, no. 1, pp. 39--59, 1984.

Digital Library

[5]

S. Brin and L. Page, "The anatomy of a large-scale hypertextual Web search engine," in Proc. WWW, 1998.

Digital Library

[6]

T.-F. Chen and J.-L. Baer, "Effective hardware-based data prefetching for high-performance processors," IEEE Trans. Comput., vol. 44, no. 5, pp. 609--623, 1995.

Digital Library

[7]

E. S. Chung et al., "LINQits: Big data on little clients," in Proc. ISCA, 2013.

Digital Library

[8]

L. Dagum and R. Menon, "OpenMP: An industry-standard API for shared-memory programming," IEEE Comput. Sci. & Eng., vol. 5, no. 1, pp. 46--55, 1998.

Digital Library

[9]

Y. Eckert et al., "Thermal feasibility of die-stacked processing in memory," in WoNDP, 2014.

[10]

M. Ferdman et al., "Clearing the clouds: A study of emerging scale-out workloads on modern hardware," in Proc. ASPLOS, 2012.

Digital Library

[11]

M. Gokhale et al., "Processing in memory: The Terasys massively parallel PIM array," IEEE Comput., vol. 28, no. 4, pp. 23--31, 1995.

Digital Library

[12]

J. E. Gonzalez et al., "PowerGraph: Distributed graph-parallel computation on natural graphs," in Proc. OSDI, 2012.

Digital Library

[13]

A. Gutierrez et al., "Integrated 3D-stacked server designs for increasing physical density of key-value stores," in Proc. ASPLOS, 2014.

Digital Library

[14]

M. Hall et al., "Mapping irregular applications to DIVA, a PIM-based data-intensive architecture," in Proc. SC, 1999.

Digital Library

[15]

P. Harish and P. J. Narayanan, "Accelerating large graph algorithms on the GPU using CUDA," in Proc. HiPC, 2007.

Digital Library

[16]

Harshvardhan et al., "KLA: A new algorithmic paradigm for parallel graph computations," in Proc. PACT, 2014.

Digital Library

[17]

S. Hong et al., "Green-Marl: A DSL for easy and efficient graph analysis," in Proc. ASPLOS, 2012.

Digital Library

[18]

S. Hong et al., "Accelerating CUDA graph algorithms at maximum warp," in Proc. PPoPP, 2011.

Digital Library

[19]

S. Hong et al., "Efficient parallel graph exploration on multi-core CPU and GPU," in Proc. PACT, 2011.

Digital Library

[20]

S. Hong et al., "Simplifying scalable graph processing with a domain-specific language," in Proc. CGO, 2014.

Digital Library

[21]

C. J. Hughes and S. V. Adve, "Memory-side prefetching for linked data structures for processor-in-memory systems," J. Parallel Distrib. Comput., vol. 65, no. 4, pp. 448--463, 2005.

Digital Library

[22]

"Hybrid memory cube specification 1.0," Hybrid Memory Cube Consortium, Tech. Rep., Jan. 2013.

[23]

"Hybrid memory cube specification 2.0," Hybrid Memory Cube Consortium, Tech. Rep., Nov. 2014.

[24]

J. Jeddeloh and B. Keeth, "Hybrid memory cube new DRAM architecture increases density and performance," in Proc. VLSIT, 2012.

[25]

N. P. Jouppi, "Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers," in Proc. ISCA, 1990.

Digital Library

[26]

Y. Kang et al., "FlexRAM: Toward an advanced intelligent memory system," in Proc. ICCD, 1999.

Digital Library

[27]

G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, no. 1, pp. 359--392, 1998.

Digital Library

[28]

T. Kgil et al., "PicoServer: Using 3D stacking technology to enable a compact energy efficient chip multiprocessor," in Proc. ASPLOS, 2006.

Digital Library

[29]

G. Kim et al., "Memory-centric system interconnect design with hybrid memory cubes," in Proc. PACT, 2013.

Digital Library

[30]

O. Kocberber et al., "Meet the walkers: Accelerating index traversals for in-memory databases," in Proc. MICRO, 2013.

Digital Library

[31]

P. M. Kogge, "EXECUBE-a new architecture for scaleable MPPs," in Proc. ICPP, 1994.

Digital Library

[32]

D. Kroft, "Lockup-free instruction fetch/prefetch cache organization," in Proc. ISCA, 1981.

Digital Library

[33]

Laboratory for Web Algorithmics. Available: http://law.di.unimi.it/datasets.php

[34]

K. Lim et al., "Thin servers with smart pipes: Designing SoC accelerators for memcached," in Proc. ISCA, 2013.

Digital Library

[35]

G. H. Loh, "3D-stacked memory architectures for multi-core processors," in Proc. ISCA, 2008.

Digital Library

[36]

G. H. Loh et al., "A processing-in-memory taxonomy and a case for studying fixed-function PIM," in WoNDP, 2013.

[37]

Y. Low et al., "Distributed GraphLab: A framework for machine learning and data mining in the cloud," Proc. VLDB Endow., vol. 5, no. 8, pp. 716--727, 2012.

Digital Library

[38]

C.-K. Luk et al., "Pin: Building customized program analysis tools with dynamic instrumentation," in Proc. PLDI, 2005.

Digital Library

[39]

G. Malewicz et al., "Pregel: A system for large-scale graph processing," in Proc. SIGMOD, 2010.

Digital Library

[40]

D. Merrill et al., "Scalable GPU graph traversal," in Proc. PPoPP, 2012.

Digital Library

[41]

2Gb: x4, x8, x16 DDR3 SDRAM, Micron Technology, 2006.

[42]

A. Mislove et al., "Measurement and analysis of online social networks," in Proc. IMC, 2007.

Digital Library

[43]

O. Mutlu et al., "Runahead execution: An alternative to very large instruction windows for out-of-order processors," in Proc. HPCA, 2003.

Digital Library

[44]

Oracle TimesTen in-memory database. Available: http://www.oracle.com/technetwork/database/timesten/

[45]

M. Oskin et al., "Active pages: A computation model for intelligent memory," in Proc. ISCA, 1998.

Digital Library

[46]

J. Ousterhout et al., "The case for RAMClouds: Scalable high-performance storage entirely in DRAM," ACM SIGOPS Oper. Syst. Rev., vol. 43, no. 4, pp. 92--105, 2010.

Digital Library

[47]

D. Patterson et al., "Intelligent RAM (IRAM): Chips that remember and compute," in ISSCC Dig. Tech. Pap., 1997.

[48]

S. Pugsley et al., "NDC: Analyzing the impact of 3D-stacked memory+logic devices on MapReduce workloads," in Proc. ISPASS, 2014.

[49]

W. Qadeer et al., "Convolution engine: Balancing efficiency & flexibility in specialized computing," in Proc. ISCA, 2013.

Digital Library

[50]

P. Ranganathan, "From microprocessors to Nanostores: Rethinking data-centric systems," IEEE Comput., vol. 44, no. 1, pp. 39--48, 2011.

Digital Library

[51]

S. Salihoglu and J. Widom, "GPS: A graph processing system," in Proc. SSDBM, 2013.

Digital Library

[52]

SAP HANA. Available: http://www.saphana.com/

[53]

V. Seshadri et al., "RowClone: Fast and energy-efficient in-DRAM bulk data copy and initialization," in Proc. MICRO, 2013.

Digital Library

[54]

M. Shevgoor et al., "Quantifying the relationship between the power delivery network and architectural policies in a 3D-stacked memory device," in Proc. MICRO, 2013.

Digital Library

[55]

Y. Solihin et al., "Using a user-level memory thread for correlation prefetching," in Proc. ISCA, 2002.

Digital Library

[56]

S. Srinath et al., "Feedback directed prefetching: Improving the performance and bandwidth-efficiency of hardware prefetchers," in Proc. HPCA, 2007.

Digital Library

[57]

M. A. Suleman et al., "Accelerating critical section execution with asymmetric multi-core architectures," in Proc. ASPLOS, 2009.

Digital Library

[58]

Y. Tian et al., "From "think like a vertex" to "think like a graph"," Proc. VLDB Endow., vol. 7, no. 3, pp. 193--204, 2013.

Digital Library

[59]

L. Wu et al., "Navigating big data with high-throughput, energy-efficient data partitioning," in Proc. ISCA, 2013.

Digital Library

[60]

C.-L. Yang and A. R. Lebeck, "Push vs. pull: Data movement for linked data structures," in Proc. ICS, 2000.

Digital Library

[61]

D. P. Zhang et al., "TOP-PIM: Throughput-oriented programmable processing in memory," in Proc. HPDC, 2014.

Digital Library

[62]

Q. Zhu et al., "A 3D-stacked logic-in-memory accelerator for application-specific data intensive computing," in Proc. 3DIC, 2013.

[63]

Q. Zhu et al., "Accelerating sparse matrix-matrix multiplication with 3D-stacked logic-in-memory hardware," in Proc. HPEC, 2013.

Cited By

Zhu JYuan YNie LTang WLi MWu HZhao XXing GZhang F(2025)A 28 nm 75.6 KOPS 13 nJ Computing-in-Memory Pipeline Number Theoretic Transform Accelerator for PQCIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.348199672:1(273-277)Online publication date: Jan-2025
https://doi.org/10.1109/TCSII.2024.3481996
Jiang QTan SChen JAn H(2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546698
Rhe JJeon KLee JJeong SKo J(2024)KERNTROL: Kernel Shape Control Toward Ultimate Memory Utilization for In-Memory Convolutional Weight MappingIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.336517571:12(6138-6151)Online publication date: Dec-2024
https://doi.org/10.1109/TCSI.2024.3365175
Show More Cited By

Index Terms

A scalable processing-in-memory accelerator for parallel graph processing
1. Computer systems organization
  1. Architectures
    1. Parallel architectures
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory

Recommendations

A scalable processing-in-memory accelerator for parallel graph processing
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

The explosion of digital data and the ever-growing need for fast data analysis have made in-memory big-data processing in computer systems increasingly important. In particular, large-scale graph processing is gaining attention due to its broad ...
Large-Scale BSP Graph Processing in Distributed Non-Volatile Memory
GRADES'15: Proceedings of the GRADES'15

Processing large graphs is becoming increasingly important for many domains. Large-scale graph processing requires a large-scale cluster system, which is very expensive. Thus, for high-performance large-scale graph processing in small clusters, we have ...
From Processing-in-Memory to Processing-in-Storage

Near-data in-memory processing research has been gaining momentum in recent years. Typical processing-in-memory architecture places a single or several processing elements next to a volatile memory, enabling processing without transferring data to the ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGARCH Computer Architecture News

ACM SIGARCH Computer Architecture News Volume 43, Issue 3S

ISCA'15

June 2015

745 pages

ISSN:0163-5964

DOI:10.1145/2872887

Editor:
Doug DeGroot
acm dot org

Issue’s Table of Contents

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2015

Published in SIGARCH Volume 43, Issue 3S

Check for updates

Qualifiers

Research-article

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

686
Total Citations
View Citations
5,720
Total Downloads

Downloads (Last 12 months)574
Downloads (Last 6 weeks)59

Reflects downloads up to 19 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu JYuan YNie LTang WLi MWu HZhao XXing GZhang F(2025)A 28 nm 75.6 KOPS 13 nJ Computing-in-Memory Pipeline Number Theoretic Transform Accelerator for PQCIEEE Transactions on Circuits and Systems II: Express Briefs10.1109/TCSII.2024.348199672:1(273-277)Online publication date: Jan-2025
https://doi.org/10.1109/TCSII.2024.3481996
Jiang QTan SChen JAn H(2024) A 3 PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE58400.2024.10546698(1-6)Online publication date: 25-Mar-2024
https://doi.org/10.23919/DATE58400.2024.10546698
Rhe JJeon KLee JJeong SKo J(2024)KERNTROL: Kernel Shape Control Toward Ultimate Memory Utilization for In-Memory Convolutional Weight MappingIEEE Transactions on Circuits and Systems I: Regular Papers10.1109/TCSI.2024.336517571:12(6138-6151)Online publication date: Dec-2024
https://doi.org/10.1109/TCSI.2024.3365175
Noh SHong JLim CPark SKim JKim HKim YLee J(2024)PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00027(245-260)Online publication date: 29-Jun-2024
https://doi.org/10.1109/ISCA59077.2024.00027
Wang LLiang LFeng RXia CWang L(2024)Circuit design based on RAM/ROM semiconductor memory circuits2024 8th International Conference on Electrical, Mechanical and Computer Engineering (ICEMCE)10.1109/ICEMCE64157.2024.10862303(527-531)Online publication date: 25-Oct-2024
https://doi.org/10.1109/ICEMCE64157.2024.10862303
Jang HSong JJung JPark JKim YLee J(2024)Smart-Infinity: Fast Large Language Model Training using Near-Storage Processing on a Real System2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00034(345-360)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00034
Yüksel İTuğrul YOlgun ABostancı FYağlıkçı AOliveira GLuo HGómez-Luna JSadrosadati MMutlu O(2024)Functionally-Complete Boolean Logic in Real DRAM Chips: Experimental Characterization and Analysis2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00030(280-296)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00030
Oliveira GOlgun AYağlıkçı ABostancı FGómez-Luna JGhose SMutlu O(2024)MIMDRAM: An End-to-End Processing-Using-DRAM System for High-Throughput, Energy-Efficient and Programmer-Transparent Multiple-Instruction Multiple-Data Computing2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA57654.2024.00024(186-203)Online publication date: 2-Mar-2024
https://doi.org/10.1109/HPCA57654.2024.00024
Yüksel İTuğrul YBostancı FOliveira GYağlıkçı AOlgun ASoysal MLuo HGómez-Luna JSadrosadati MMutlu O(2024)Simultaneous Many-Row Activation in Off-the-Shelf DRAM Chips: Experimental Characterization and Analysis2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)10.1109/DSN58291.2024.00024(99-114)Online publication date: 24-Jun-2024
https://doi.org/10.1109/DSN58291.2024.00024
Fernandez IGiannoula CManglik AQuislant RGhiasi NGómez-Luna JGutierrez EPlata OMutlu O(2024)MATSA: An MRAM-Based Energy-Efficient Accelerator for Time Series AnalysisIEEE Access10.1109/ACCESS.2024.337331112(36727-36742)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3373311
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents