research-article

NUMA-aware graph-structured analytics

Authors:
Kaiyuan Zhang

Shanghai Jiao Tong University, China

Shanghai Jiao Tong University, China
View Profile

,
Rong Chen

Shanghai Jiao Tong University, China

Shanghai Jiao Tong University, China
View Profile

,
Haibo Chen

Shanghai Jiao Tong University, China

Shanghai Jiao Tong University, China
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 50 Issue 8August 2015pp 183–193https://doi.org/10.1145/2858788.2688507

Published:24 January 2015Publication History

ACM SIGPLAN Notices

Abstract

Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.

References

The 9th dimacs implementation challenge - shortest paths. http://www.dis.uniroma1.it/challenge9/.Google Scholar
Graph 500. http://www.graph500.org.Google Scholar
numactl. http://oss.sgi.com/projects/libnuma/.Google Scholar
L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.Google Scholar
A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new os architecture for scalable multicore systems. In SOSP, 2009. Google ScholarDigital Library
S. Beamer, K. Asanovi´c, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3):137–148, 2013. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998. Google ScholarDigital Library
I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In PPoPP, 2013. Google ScholarDigital Library
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages 442–446. SIAM, 2004.Google Scholar
R. Chen and H. Chen. Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM TACO, 10, 2013. Google ScholarDigital Library
R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In PACT, 2010. Google ScholarDigital Library
R. Chen, J. Shi, Y. Chen, H. Guan, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report 2013-11-001, IPADS, SJTU, 2013.Google Scholar
R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, 2014. Google ScholarDigital Library
R. Chen, J. Shi, B. Zang, and H. Guan. Bipartite-oriented distributed graph partitioning for big learning. In APSys, 2014. Google ScholarDigital Library
R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, pages 85–98, 2012. Google ScholarDigital Library
T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to algorithms, volume 2. MIT press Cambridge, 2001. Google ScholarDigital Library
M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on numa systems. In ASPLOS, 2013. Google ScholarDigital Library
T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP, 2013. Google ScholarDigital Library
D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In PPoPP, pages 247–256, 2012. Google ScholarDigital Library
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999. Google ScholarDigital Library
F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, V. Quema, and I. Grenoble. Large pages may be harmful on numa systems. In USENIX ATC, 2014. Google ScholarDigital Library
J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012. Google ScholarDigital Library
J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and E. Chen. Chronos: a graph engine for temporal graph analysis. In EuroSys, 2014. Google ScholarDigital Library
U. Kang, D. Horng, et al. Inference of beliefs on billion-scale graphs. In SIGKDD-LDMTA, 2010.Google Scholar
Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In EuroSys, pages 169–182, 2013. Google ScholarDigital Library
H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010. Google ScholarDigital Library
A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, 2012. Google ScholarDigital Library
R. Lachaize, B. Lepers, V. Quéma, et al. Memprof: A memory profiler for numa multicore systems. In USENIX ATC, 2012. Google ScholarDigital Library
X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In PPoPP, 2014. Google ScholarDigital Library
Y. Liu, B. Wu, H. Wang, and P. Ma. Bpgm: A big graph mining tool. Tsinghua Science and Technology, 19(1):33–38, 2014.Google ScholarCross Ref
Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment, 5(8):716–727, 2012. Google ScholarDigital Library
A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007.Google ScholarCross Ref
Z. Majo and T. R. Gross. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In ISMM, 2011. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21–65, 1991. Google ScholarDigital Library
U. Meyer and P. Sanders. δ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003. Google ScholarDigital Library
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439–455. ACM, 2013. Google ScholarDigital Library
D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013. Google ScholarDigital Library
B. Panda, J. Herbach, S. Basu, and R. Bayardo. PLANET: massively parallel learning of tree ensembles with MapReduce. In VLDB, 2009. Google ScholarDigital Library
V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In Usenix ATC, 2012. Google ScholarDigital Library
A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP, 2013. Google ScholarDigital Library
J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013. Google ScholarDigital Library
A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010. Google ScholarDigital Library
P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based fault-tolerance for large-scale graph processing. In DSN, 2014. Google ScholarDigital Library
C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015. Google ScholarDigital Library
X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao. On the embeddability of random walk distances. In VLDB, 2013. Google ScholarDigital Library
J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. TPDS, 2013. Google ScholarDigital Library
X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.Google Scholar

Index Terms

NUMA-aware graph-structured analytics
1. Computing methodologies
  1. Concurrent computing methodologies
    1. Concurrent programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Concurrent programming languages

Recommendations

NUMA-aware graph-structured analytics
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the ...
Read More
Node-based memory management for scalable NUMA architectures
ROSS '12: Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers

Large state-of-the-art NUMA systems may offer more than two levels of node distances. The result is a hierarchical architecture with significant differences in memory access bandwidth and latency. Consequently, NUMA-aware memory management and the ...
Read More
Understanding the performance of storage class memory file systems in the NUMA architecture

Recent developments in storage class memory (SCM) such as PCM, MRAM, resistive RAM (RRAM), and spin-transfer torque (STT)-RAM have strengthened their leadership as storage media for memory-based file systems. Traditional Linux memory-based file systems ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 50, Issue 8
PPoPP '15
August 2015
290 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2858788
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
January 2015
290 pages
ISBN:9781450332057
DOI:10.1145/2688500
General Chair:
Albert Cohen
INRIA, France
,
Program Chair:
David Grove
IBM Research, USA
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 January 2015
Check for updates
Author Tags
Graph-structured Analytics
Non-uniform Memory Access (NUMA)
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 153
  Total Citations
  View Citations
- 1,472
  Total Downloads
- Downloads (Last 12 months)90
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

NUMA-aware graph-structured analytics

ACM SIGPLAN Notices

Abstract

References

Cited By

Index Terms

Recommendations

NUMA-aware graph-structured analytics

Node-based memory management for scalable NUMA architectures

Understanding the performance of storage class memory file systems in the NUMA architecture