skip to main content
research-article

NUMA-aware graph-structured analytics

Published:24 January 2015Publication History
Skip Abstract Section

Abstract

Graph-structured analytics has been widely adopted in a number of big data applications such as social computation, web-search and recommendation systems. Though much prior research focuses on scaling graph-analytics on distributed environments, the strong desire on performance per core, dollar and joule has generated considerable interests of processing large-scale graphs on a single server-class machine, which may have several terabytes of RAM and 80 or more cores. However, prior graph-analytics systems are largely neutral to NUMA characteristics and thus have suboptimal performance. This paper presents a detailed study of NUMA characteristics and their impact on the efficiency of graph-analytics. Our study uncovers two insights: 1) either random or interleaved allocation of graph data will significantly hamper data locality and parallelism; 2) sequential inter-node (i.e., remote) memory accesses have much higher bandwidth than both intra- and inter-node random ones. Based on them, this paper describes Polymer, a NUMA-aware graph-analytics system on multicore with two key design decisions. First, Polymer differentially allocates and places topology data, application-defined data and mutable runtime states of a graph system according to their access patterns to minimize remote accesses. Second, for some remaining random accesses, Polymer carefully converts random remote accesses into sequential remote accesses, by using lightweight replication of vertices across NUMA nodes. To improve load balance and vertex convergence, Polymer is further built with a hierarchical barrier to boost parallelism and locality, an edge-oriented balanced partitioning for skewed graphs, and adaptive data structures according to the proportion of active vertices. A detailed evaluation on an 80-core machine shows that Polymer often outperforms the state-of-the-art single-machine graph-analytics systems, including Ligra, X-Stream and Galois, for a set of popular real-world and synthetic graphs.

References

  1. The 9th dimacs implementation challenge - shortest paths. http://www.dis.uniroma1.it/challenge9/.Google ScholarGoogle Scholar
  2. Graph 500. http://www.graph500.org.Google ScholarGoogle Scholar
  3. numactl. http://oss.sgi.com/projects/libnuma/.Google ScholarGoogle Scholar
  4. L. A. Adamic and B. A. Huberman. Zipf’s law and the internet. Glottometrics, 3(1):143–150, 2002.Google ScholarGoogle Scholar
  5. A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The multikernel: a new os architecture for scalable multicore systems. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Beamer, K. Asanovi´c, and D. Patterson. Direction-optimizing breadth-first search. Scientific Programming, 21(3):137–148, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In WWW, pages 107–117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. I. Calciu, D. Dice, Y. Lev, V. Luchangco, V. J. Marathe, and N. Shavit. Numa-aware reader-writer locks. In PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, volume 4, pages 442–446. SIAM, 2004.Google ScholarGoogle Scholar
  10. R. Chen and H. Chen. Tiled-mapreduce: Efficient and flexible mapreduce processing on multicore with tiling. ACM TACO, 10, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Chen, H. Chen, and B. Zang. Tiled-mapreduce: optimizing resource usages of data-parallel applications on multicore with tiling. In PACT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Chen, J. Shi, Y. Chen, H. Guan, and H. Chen. Powerlyra: Differentiated graph computation and partitioning on skewed graphs. Technical Report 2013-11-001, IPADS, SJTU, 2013.Google ScholarGoogle Scholar
  13. R. Chen, X. Ding, P. Wang, H. Chen, B. Zang, and H. Guan. Computation and communication efficient graph processing with distributed immutable view. In HPDC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Chen, J. Shi, B. Zang, and H. Guan. Bipartite-oriented distributed graph partitioning for big learning. In APSys, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Cheng, J. Hong, A. Kyrola, Y. Miao, X. Weng, M. Wu, F. Yang, L. Zhou, F. Zhao, and E. Chen. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys, pages 85–98, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, et al. Introduction to algorithms, volume 2. MIT press Cambridge, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Dashti, A. Fedorova, J. Funston, F. Gaud, R. Lachaize, B. Lepers, V. Quema, and M. Roth. Traffic management: a holistic approach to memory placement on numa systems. In ASPLOS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. David, R. Guerraoui, and V. Trigonakis. Everything you always wanted to know about synchronization but were afraid to ask. In SOSP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Dice, V. J. Marathe, and N. Shavit. Lock cohorting: a general technique for designing numa locks. In PPoPP, pages 247–256, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251–262, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Gaud, B. Lepers, J. Decouchant, J. Funston, A. Fedorova, V. Quema, and I. Grenoble. Large pages may be harmful on numa systems. In USENIX ATC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Han, Y. Miao, K. Li, M. Wu, F. Yang, L. Zhou, V. Prabhakaran, W. Chen, and E. Chen. Chronos: a graph engine for temporal graph analysis. In EuroSys, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. U. Kang, D. Horng, et al. Inference of beliefs on billion-scale graphs. In SIGKDD-LDMTA, 2010.Google ScholarGoogle Scholar
  26. Z. Khayyat, K. Awara, A. Alonazi, H. Jamjoom, D. Williams, and P. Kalnis. Mizan: a system for dynamic load balancing in large-scale graph processing. In EuroSys, pages 169–182, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Kwak, C. Lee, H. Park, and S. Moon. What is twitter, a social network or a news media? In WWW, pages 591–600, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Kyrola, G. Blelloch, and C. Guestrin. GraphChi: Large-scale graph computation on just a PC. In OSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. Lachaize, B. Lepers, V. Quéma, et al. Memprof: A memory profiler for numa multicore systems. In USENIX ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Liu and J. Mellor-Crummey. A tool to analyze the performance of multithreaded programs on numa architectures. In PPoPP, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Liu, B. Wu, H. Wang, and P. Ma. Bpgm: A big graph mining tool. Tsinghua Science and Technology, 19(1):33–38, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  32. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. Distributed GraphLab: a framework for machine learning and data mining in the cloud. VLDB Endowment, 5(8):716–727, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. A. Lumsdaine, D. Gregor, B. Hendrickson, and J. Berry. Challenges in parallel graph processing. Parallel Processing Letters, 17(1):5–20, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  34. Z. Majo and T. R. Gross. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In ISMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Mellor-Crummey and M. L. Scott. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, 9(1):21–65, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. U. Meyer and P. Sanders. δ -stepping: a parallelizable shortest path algorithm. Journal of Algorithms, 49(1):114–152, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439–455. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Nguyen, A. Lenharth, and K. Pingali. A lightweight infrastructure for graph analytics. In SOSP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. Panda, J. Herbach, S. Basu, and R. Bayardo. PLANET: massively parallel learning of tree ensembles with MapReduce. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. V. Prabhakaran, M. Wu, X. Weng, F. McSherry, L. Zhou, and M. Haridasan. Managing large graphs on multi-cores with graph awareness. In Usenix ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. A. Roy, I. Mihailovic, and W. Zwaenepoel. X-stream: Edge-centric graph processing using streaming partitions. In SOSP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. J. Shun and G. E. Blelloch. Ligra: a lightweight graph processing framework for shared memory. In PPoPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. A. Smola and S. Narayanamurthy. An architecture for parallel topic models. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. P. Wang, K. Zhang, R. Chen, H. Chen, and H. Guan. Replication-based fault-tolerance for large-scale graph processing. In DSN, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Xie, R. Chen, H. Guan, B. Zang, and H. Chen. Sync or async: Time to fuse for distributed graph-parallel computation. In PPoPP, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. X. Zhao, A. Chang, A. D. Sarma, H. Zheng, and B. Y. Zhao. On the embeddability of random walk distances. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. J. Zhong and B. He. Medusa: Simplified Graph Processing on GPUs. TPDS, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical report, Technical Report CMUCALD-02-107, Carnegie Mellon University, 2002.Google ScholarGoogle Scholar

Index Terms

  1. NUMA-aware graph-structured analytics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 50, Issue 8
        PPoPP '15
        August 2015
        290 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/2858788
        • Editor:
        • Andy Gill
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
          January 2015
          290 pages
          ISBN:9781450332057
          DOI:10.1145/2688500

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 January 2015

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader