skip to main content
10.1145/1807167.1807184acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Pregel: a system for large-scale graph processing

Published:06 June 2010Publication History

ABSTRACT

Many practical computing problems concern large graphs. Standard examples include the Web graph and various social networks. The scale of these graphs - in some cases billions of vertices, trillions of edges - poses challenges to their efficient processing. In this paper we present a computational model suitable for this task. Programs are expressed as a sequence of iterations, in each of which a vertex can receive messages sent in the previous iteration, send messages to other vertices, and modify its own state and that of its outgoing edges or mutate graph topology. This vertex-centric approach is flexible enough to express a broad set of algorithms. The model has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier. Distribution-related details are hidden behind an abstract API. The result is a framework for processing large graphs that is expressive and easy to program.

References

  1. Thomas Anderson, Susan Owicki, James Saxe, and Charles Thacker, High-Speed Switch Scheduling for Local-Area Networks. ACM Trans. Comp. Syst. 11(4), 1993, 319--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. David A. Bader and Kamesh Madduri, Designing multithreaded algorithms for breadth-first search and st-connectivity on the Cray MTA-2, in Proc. 35th Intl. Conf. on Parallel Processing (ICPP'06), Columbus, OH, August 2006, 523--530. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Luiz Barroso, Jeffrey Dean, and Urs Hoelzle, Web search for a planet: The Google Cluster Architecture. IEEE Micro 23(2), 2003, 22--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Mohsen Bayati, Devavrat Shah, and Mayank Sharma, Maximum Weight Matching via Max-Product Belief Propagation. in Proc. IEEE Intl. Symp. on Information Theory, 2005, 1763--1767.Google ScholarGoogle Scholar
  5. Richard Bellman, On a routing problem. Quarterly of Applied Mathematics 16(1), 1958, 87--90.Google ScholarGoogle ScholarCross RefCross Ref
  6. Olaf Bonorden, Ben H. H. Juurlink, Ingo von Otte, and Ingo Rieping, The Paderborn University BSP (PUB) Library. Parallel Computing 29(2), 2003, 187--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine. in Proc. 7th Intl. Conf. on the World Wide Web, 1998, 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Albert Chan and Frank Dehne, CGMGRAPH/CGMLIB: Implementing and Testing CGM Graph Algorithms on PC Clusters and Shared Memory Machines. Intl. J. of High Performance Computing Applications 19(1), 2005, 81--97.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comp. Syst. 26(2), Art. 4, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Boris V. Cherkassky, Andrew V. Goldberg, and Tomasz Radzik, Shortest paths algorithms: Theory and experimental evaluation. Mathematical Programming 73, 1996, 129--174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jonathan Cohen, Graph Twiddling in a MapReduce World. Comp. in Science & Engineering, July/August 2009, 29--41. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Joseph R. Crobak, Jonathan W. Berry, Kamesh Madduri, and David A. Bader, Advanced Shortest Paths Algorithms on a Massively-Multithreaded Architecture. in Proc. First Workshop on Multithreaded Architectures and Applications, 2007, 1--8.Google ScholarGoogle Scholar
  13. John T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems 22, 2006, 303--312. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. in Proc. 6th USENIX Symp. on Operating Syst. Design and Impl., 2004, 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Edsger W. Dijkstra, A Note on Two Problems in Connexion with Graphs. Numerische Mathematik 1, 1959, 269--271.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Martin Erwig, Inductive Graphs and Functional Graph Algorithms. J. Functional Programming 1(5), 2001, 467--492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lester R. Ford, L. R. and Delbert R. Fulkerson, Flows in Networks. Princeton University Press, 1962.Google ScholarGoogle Scholar
  18. Ian Foster and Carl Kesselman (Eds), The Grid 2: Blueprint for a New Computing Infrastructure (2nd edition). Morgan Kaufmann, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System. in Proc. 19th ACM Symp. on Operating Syst. Principles, 2003, 29--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Michael T. Goodrich and Roberto Tamassia, Data Structures and Algorithms in JAVA. (second edition). John Wiley and Sons, Inc., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Mark W. Goudreau, Kevin Lang, Satish B. Rao, Torsten Suel, and Thanasis Tsantilas, Portable and Efficient Parallel Computing Using the BSP Model. IEEE Trans. Comp. 48(7), 1999, 670--689. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Douglas Gregor and Andrew Lumsdaine, The Parallel BGL: A Generic Library for Distributed Graph Computations. Proc. of Parallel Object-Oriented Scientific Computing (POOSC), July 2005.Google ScholarGoogle Scholar
  23. Douglas Gregor and Andrew Lumsdaine, Lifting Sequential Graph Algorithms for Distributed-Memory Parallel Computation. in Proc. 2005 ACM SIGPLAN Conf. on Object-Oriented Prog., Syst., Lang., and Applications (OOPSLA'05), October 2005, 423--437. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Jonathan L. Gross and Jay Yellen, Graph Theory and Its Applications. (2nd Edition). Chapman and Hall/CRC, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart, Exploring network structure, dynamics, and function using NetworkX. in Proc. 7th Python in Science Conf., 2008, 11--15.Google ScholarGoogle Scholar
  26. Jonathan Hill, Bill McColl, Dan Stefanescu, Mark Goudreau, Kevin Lang, Satish Rao, Torsten Suel, Thanasis Tsantilas, and Rob Bisseling, BSPlib: The BSP Programming Library. Parallel Computing 24, 1998, 1947--1980. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly, Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. in Proc. European Conf. on Computer Syst., 2007, 59--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Paris C. Kanellakis and Alexander A. Shvartsman, Fault-Tolerant Parallel Computation. Kluwer Academic Publishers, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Donald E. Knuth, Stanford GraphBase: A Platform for Combinatorial Computing. ACM Press, 1994. Google ScholarGoogle Scholar
  30. U. Kung, Charalampos E. Tsourakakis, and Christos Faloutsos, Pegasus: A Peta-Scale Graph Mining System - Implementation and Observations. Proc. Intl. Conf. Data Mining, 2009, 229--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Andrew Lumsdaine, Douglas Gregor, Bruce Hendrickson, and Jonathan W. Berry, Challenges in Parallel Graph Processing. Parallel Processing Letters 17, 2007, 5--20.Google ScholarGoogle ScholarCross RefCross Ref
  32. Kamesh Madduri, David A. Bader, Jonathan W. Berry, and Joseph R. Crobak, Parallel Shortest Path Algorithms for Solving Large-Scale Graph Instances. DIMACS Implementation Challenge - The Shortest Path Problem, 2006.Google ScholarGoogle Scholar
  33. Kamesh Madduri, David Ediger, Karl Jiang, David A. Bader, and Daniel Chavarria-Miranda, A Faster Parallel Algorithm and Efficient Multithreaded Implementation for Evaluating Betweenness Centrality on Massive Datasets, in Proc. 3rd Workshop on Multithreaded Architectures and Applications (MTAAP'09), Rome, Italy, May 2009.Google ScholarGoogle Scholar
  34. Grzegorz Malewicz, A Work-Optimal Deterministic Algorithm for the Certified Write-All Problem with a Nontrivial Number of Asynchronous Processors. SIAM J. Comput. 34(4), 2005, 993--1024. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Kurt Mehlhorn and Stefan Näher, The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ulrich Meyer and Vitaly Osipov, Design and Implementation of a Practical I/O-efficient Shortest Paths Algorithm. in Proc. 3rd Workshop on Multithreaded Architectures and Applications (MTAAP'09), Rome, Italy, May 2009.Google ScholarGoogle ScholarCross RefCross Ref
  37. Ulrich Meyer and Peter Sanders, Δ-stepping: A Parallelizable Shortest Path Algorithm. J. Algorithms 49(1), 2003, 114--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Richard Miller, A Library for Bulk-Synchronous Parallel Programming. in Proc. British Computer Society Parallel Processing Specialist Group Workshop on General Purpose Parallel Computing, 1993.Google ScholarGoogle Scholar
  39. Kameshwar Munagala and Abhiram Ranade, I/O-complexity of graph algorithms. in Proc. 10th Annual ACM-SIAM Symp. on Discrete Algorithms, 1999, 687--694. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins, Pig Latin: A Not-So-Foreign Language for Data Processing. in Proc. ACM SIGMOD Intl. Conf. on Management of Data, 2008, 1099--1110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Rob Pike, Sean Dorward, Robert Griesemer, and Sean Quinlan, Interpreting the Data: Parallel Analysis with Sawzall. Scientific Programming Journal 13(4), Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure, 2005, 227--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Protocol Buffers-Google's data interchange format. http://code.google.com/p/protobuf/ 2009.Google ScholarGoogle Scholar
  43. Jeremy G. Siek, Lie-Quan Lee, and Andrew Lumsdaine, The Boost Graph Library: User Guide and Reference Manual. Addison Wesley, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Mikkel Thorup, Undirected Single-Source Shortest Paths with Positive Integer Weights in Linear Time. J. ACM 46(3), May 1999, 362--394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Leslie G. Valiant, A Bridging Model for Parallel Computation. Comm. ACM 33(8), 1990, 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Andy Yoo, Edmond Chow, Keith Henderson, William McLendon, Bruce Hendrickson, and Umit Catalyurek, A Scalable Distributed Parallel Breadth-First Search Algorithm on BlueGene/L, in Proc. 2005 ACM/IEEE Conf. on Supercomputing (SC'05), 2005, 25--43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey, DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. in Proc. 8th USENIX Symp. on Operating Syst. Design and Implementation, 2008, 10--14. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Pregel: a system for large-scale graph processing

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
          June 2010
          1286 pages
          ISBN:9781450300322
          DOI:10.1145/1807167

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 6 June 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader