Abstract
Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.
- The 9th dimacs implementation challenge - shortest paths. http://www.dis.uniroma1.it/challenge9/.Google Scholar
- Graph 500. http://www.graph500.org.Google Scholar
- numactl. http://oss.sgi.com/projects/libnuma/.Google Scholar
- L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.Google Scholar
- A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new os architecture for scalable multicore systems. In SOSP, 2009. Google ScholarDigital Library
- S. Beamer, K. Asanovi´c, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3):137–148, 2013. Google ScholarDigital Library
- S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998. Google ScholarDigital Library
- I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In PPoPP, 2013. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages 442–446. SIAM, 2004.Google Scholar
- R. Chen and H. Chen. Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM TACO, 10, 2013. Google ScholarDigital Library
- R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In PACT, 2010. Google ScholarDigital Library
- R. Chen, J. Shi, Y. Chen, H. Guan, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report 2013-11-001, IPADS, SJTU, 2013.Google Scholar
- R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, 2014. Google ScholarDigital Library
- R. Chen, J. Shi, B. Zang, and H. Guan. Bipartite-oriented distributed graph partitioning for big learning. In APSys, 2014. Google ScholarDigital Library
- R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, pages 85–98, 2012. Google ScholarDigital Library
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to algorithms, volume 2. MIT press Cambridge, 2001. Google ScholarDigital Library
- M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on numa systems. In ASPLOS, 2013. Google ScholarDigital Library
- T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP, 2013. Google ScholarDigital Library
- D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In PPoPP, pages 247–256, 2012. Google ScholarDigital Library
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999. Google ScholarDigital Library
- F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, V. Quema, and I. Grenoble. Large pages may be harmful on numa systems. In USENIX ATC, 2014. Google ScholarDigital Library
- J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012. Google ScholarDigital Library
- J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
- W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and E. Chen. Chronos: a graph engine for temporal graph analysis. In EuroSys, 2014. Google ScholarDigital Library
- U. Kang, D. Horng, et al. Inference of beliefs on billion-scale graphs. In SIGKDD-LDMTA, 2010.Google Scholar
- Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In EuroSys, pages 169–182, 2013. Google ScholarDigital Library
- H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010. Google ScholarDigital Library
- A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, 2012. Google ScholarDigital Library
- R. Lachaize, B. Lepers, V. Quéma, et al. Memprof: A memory profiler for numa multicore systems. In USENIX ATC, 2012. Google ScholarDigital Library
- X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In PPoPP, 2014. Google ScholarDigital Library
- Y. Liu, B. Wu, H. Wang, and P. Ma. Bpgm: A big graph mining tool. Tsinghua Science and Technology, 19(1):33–38, 2014.Google ScholarCross Ref
- Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment, 5(8):716–727, 2012. Google ScholarDigital Library
- A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007.Google ScholarCross Ref
- Z. Majo and T. R. Gross. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In ISMM, 2011. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
- J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21–65, 1991. Google ScholarDigital Library
- U. Meyer and P. Sanders. δ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003. Google ScholarDigital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439–455. ACM, 2013. Google ScholarDigital Library
- D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013. Google ScholarDigital Library
- B. Panda, J. Herbach, S. Basu, and R. Bayardo. PLANET: massively parallel learning of tree ensembles with MapReduce. In VLDB, 2009. Google ScholarDigital Library
- V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In Usenix ATC, 2012. Google ScholarDigital Library
- A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP, 2013. Google ScholarDigital Library
- J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013. Google ScholarDigital Library
- A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010. Google ScholarDigital Library
- P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based fault-tolerance for large-scale graph processing. In DSN, 2014. Google ScholarDigital Library
- C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015. Google ScholarDigital Library
- X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao. On the embeddability of random walk distances. In VLDB, 2013. Google ScholarDigital Library
- J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. TPDS, 2013. Google ScholarDigital Library
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.Google Scholar
Index Terms
- NUMA-aware graph-structured analytics
Recommendations
NUMA-aware graph-structured analytics
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingGraph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the ...
Node-based memory management for scalable NUMA architectures
ROSS '12: Proceedings of the 2nd International Workshop on Runtime and Operating Systems for SupercomputersLarge state-of-the-art NUMA systems may offer more than two levels of node distances. The result is a hierarchical architecture with significant differences in memory access bandwidth and latency. Consequently, NUMA-aware memory management and the ...
Understanding the performance of storage class memory file systems in the NUMA architecture
Recent developments in storage class memory (SCM) such as PCM, MRAM, resistive RAM (RRAM), and spin-transfer torque (STT)-RAM have strengthened their leadership as storage media for memory-based file systems. Traditional Linux memory-based file systems ...
Comments