ABSTRACT
The development of cloud storage and computing has facilitated the rise of various big data applications. As a representative high performance computing (HPC) workload, graph processing is becoming a part of cloud computing. However, scalable computing on large graphs is still dominated by HPC solutions, which require high performance all-to-all collective operations over torus (or mesh) networking. Implementing those torus-based algorithms on commodity clusters, e.g., cloud computing infrastructures, can result in great latency due to inefficient communication. Moreover, designing a highly scalable system for large social graphs, is far from being trivial, as intrinsic features of social graphs, e.g., degree skewness and lacking of locality, often profoundly limit the extent of parallelism.
To resolve the challenges, we explore the iceberg of developing a scalable system for processing large social graphs on commodity clusters. In particular, we focus on the scale-out capability of the system. We propose a novel separator-combiner based query processing engine which provides native load-balancing and very low communication overhead, such that increasinglylarger graphs can be simply addressed by adding more computing nodes to the cluster.The proposed system achieves remarkable scale-out capability in processing large social graphs with skew degree distributions, while providing many critical features for big data analytics, such as easy-to-use API, fault-tolerance and recovery. We implement the system as a portable and easily configurable library, and conduct comprehensive experimental studies to demonstrate its effectiveness and efficiency.
- Apache giraph: http://giraph.apache.org/.Google Scholar
- The graph 500 list,http://www.graph500.org/.Google Scholar
- A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web. Computer Networks 33(1--6):309--320, 2000. Google ScholarDigital Library
- N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. Tao: Facebook's distributed data store for the social graph. In Presented as part of the 2013 USENIX ATC, pages 49--60, 2013. Google ScholarDigital Library
- D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, pages 442--446, 2004.Google ScholarCross Ref
- E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper., 19(13):1749--1783, Sept. 2007. Google ScholarDigital Library
- F. Checconi and F. Petrini. Massive data analytics: The graph 500 on ibm blue gene/q. IBM Journal of Research and Development, 57(1/2):10, 2013. Google ScholarDigital Library
- M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251--262, 1999. Google ScholarDigital Library
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on OSDI, pages 17--30, 2012. Google ScholarDigital Library
- D. Gregor and A. Lumsdaine. The parallel bgl: A generic library for distributed graph computations. In Parallel Object-Oriented Scientific Computing, 2005.Google Scholar
- T. Gunarathne, J. Qiu, and D. Gannon. Towards a collective layer in the big data stack. In CCGrid, pages 236--245, 2014.Google ScholarDigital Library
- M. Han and K. Daudjee. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. PVLDB, 8(9):950--961, 2015. Google ScholarDigital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: mining peta-scale graphs. Knowl. Inf. Syst., 27(2):303--325, 2011. Google ScholarDigital Library
- K. Lee and L. Liu. Efficient data partitioning model for heterogeneous graphs in the cloud. In SC, page 46, 2013. Google ScholarDigital Library
- J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. CoRR, abs/1204.6078, 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
- J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. Dongarra. Performance analysis of mpi collective operations. pages 127--143, 2007. Google ScholarDigital Library
- L. Qin, J. Yu, L. Chang, H. Cheng, C. Zhang, and X. Lin. Scalable big graph processing in mapreduce. In SIGMOD Conference, pages 827--838, 2014. Google ScholarDigital Library
- M. Redekopp, Y. Simmhan, and V. K. Prasanna. Optimizations and analysis of bsp graph processing models on public clouds. Parallel and Distributed Processing Symposium, International, 0:203--214, 2013. Google ScholarDigital Library
- S. Redner. How popular is your paper? an empirical study of the citation distribution. European Physical Journal B, 4(2):131--134, Aug. 1998.Google ScholarCross Ref
- S. Salihoglu and J. Widom. Gps: A graph processing system. In SSDBM, 2013. Google ScholarDigital Library
- S. Salihoglu and J. Widom. Optimizing graph algorithms on pregel-like systems. In PVLDB, 2014. Google ScholarDigital Library
- N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In SIGMOD Conference, pages 979--990, 2014. Google ScholarDigital Library
- S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. Hama: An efficient matrix computation with the mapreduce framework. In CloudCom, pages 721--726, 2010. Google ScholarDigital Library
- H. Shang and M. Kitsuregawa. Efficient breadth-first search on large graphs with skewed degree distributions. In Joint 2013 EDBT/ICDT Conferences, pages 311--322, 2013. Google ScholarDigital Library
- B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In SIGMOD Conference, pages 505--516, 2013. Google ScholarDigital Library
- S. Tasci and M. Demirbas. Giraphx: Parallel yet serializable large-scale graph processing. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par'13, pages 458--469, Berlin, Heidelberg, 2013. Springer-Verlag. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.Google ScholarCross Ref
- Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. From "think like a vertex" to "think like a graph". PVLDB, 7(3):193--204, 2013. Google ScholarDigital Library
- K. Ueno and T. Suzumura. 2d partitioning based graph search for the graph500 benchmark. In IPDPS Workshops, pages 1925--1931, 2012. Google ScholarDigital Library
- L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
- T. White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale. O'Reilly, 2012. Google ScholarDigital Library
- R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: a resilient distributed graph system on spark. In GRADES, page 2, 2013. Google ScholarDigital Library
- A. Yoo, E. Chow, K. W. Henderson, W. M. III, B. Hendrickson, and U. V. "Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene"l. In SC, page 25, 2005. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on HotCloud, pages 10--10, 2010 Google ScholarDigital Library
Index Terms
- Towards Scale-out Capability on Social Graphs
Recommendations
k-tuple domination in graphs
In a graph G, a vertex is said to dominate itself and all of its neighbors. For a fixed positive integer k, the k-tuple domination problem is to find a minimum sized vertex subset in a graph such that every vertex in the graph is dominated by at least k ...
Towards scalable distributed graph database engine for hybrid clouds
DataCloud '14: Proceedings of the 5th International Workshop on Data-Intensive Computing in the CloudsLarge graph data management and mining in clouds has become an important issue in recent times. We propose Acacia which is a distributed graph database engine for scalable handling of such large graph data. Acacia operates between the boundaries of ...
The clique operator on circular-arc graphs
A circular-arc graphG is the intersection graph of a collection of arcs on the circle and such a collection is called a model of G. Say that the model is proper when no arc of the collection contains another one, it is Helly when the arcs satisfy the ...
Comments