Towards Scale-out Capability on Social Graphs

Authors:
Haichuan Shang

National Institute of Information and Communications Technology, Japan and University of Tokyo, Japan, Tokyo, Japan

National Institute of Information and Communications Technology, Japan and University of Tokyo, Japan, Tokyo, Japan
View Profile

,
Xiang Zhao

National University of Defense Technology, China, Changsha, China

National University of Defense Technology, China, Changsha, China
View Profile

,
Uday Kiran

National Institute of Information and Communications Technology, Japan and University of Tokyo, Japan, Tokyo, Japan

National Institute of Information and Communications Technology, Japan and University of Tokyo, Japan, Tokyo, Japan
View Profile

,
Masaru Kitsuregawa

National Institute of Informatics, Japan and University of Tokyo, Japan, Tokyo, Japan

National Institute of Informatics, Japan and University of Tokyo, Japan, Tokyo, Japan
View Profile

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge ManagementOctober 2015Pages 253–262https://doi.org/10.1145/2806416.2806420

Published:17 October 2015Publication History

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

Pages 253–262

ABSTRACT

The development of cloud storage and computing has facilitated the rise of various big data applications. As a representative high performance computing (HPC) workload, graph processing is becoming a part of cloud computing. However, scalable computing on large graphs is still dominated by HPC solutions, which require high performance all-to-all collective operations over torus (or mesh) networking. Implementing those torus-based algorithms on commodity clusters, e.g., cloud computing infrastructures, can result in great latency due to inefficient communication. Moreover, designing a highly scalable system for large social graphs, is far from being trivial, as intrinsic features of social graphs, e.g., degree skewness and lacking of locality, often profoundly limit the extent of parallelism.

To resolve the challenges, we explore the iceberg of developing a scalable system for processing large social graphs on commodity clusters. In particular, we focus on the scale-out capability of the system. We propose a novel separator-combiner based query processing engine which provides native load-balancing and very low communication overhead, such that increasinglylarger graphs can be simply addressed by adding more computing nodes to the cluster.The proposed system achieves remarkable scale-out capability in processing large social graphs with skew degree distributions, while providing many critical features for big data analytics, such as easy-to-use API, fault-tolerance and recovery. We implement the system as a portable and easily configurable library, and conduct comprehensive experimental studies to demonstrate its effectiveness and efficiency.

References

Apache giraph: http://giraph.apache.org/.Google Scholar
The graph 500 list,http://www.graph500.org/.Google Scholar
A. Z. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. L. Wiener. Graph structure in the web. Computer Networks 33(1--6):309--320, 2000. Google ScholarDigital Library
N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. Tao: Facebook's distributed data store for the social graph. In Presented as part of the 2013 USENIX ATC, pages 49--60, 2013. Google ScholarDigital Library
D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In SDM, pages 442--446, 2004.Google ScholarCross Ref
E. Chan, M. Heimlich, A. Purkayastha, and R. van de Geijn. Collective communication: Theory, practice, and experience: Research articles. Concurr. Comput.: Pract. Exper., 19(13):1749--1783, Sept. 2007. Google ScholarDigital Library
F. Checconi and F. Petrini. Massive data analytics: The graph 500 on ibm blue gene/q. IBM Journal of Research and Development, 57(1/2):10, 2013. Google ScholarDigital Library
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, pages 251--262, 1999. Google ScholarDigital Library
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Conference on OSDI, pages 17--30, 2012. Google ScholarDigital Library
D. Gregor and A. Lumsdaine. The parallel bgl: A generic library for distributed graph computations. In Parallel Object-Oriented Scientific Computing, 2005.Google Scholar
T. Gunarathne, J. Qiu, and D. Gannon. Towards a collective layer in the big data stack. In CCGrid, pages 236--245, 2014.Google ScholarDigital Library
M. Han and K. Daudjee. Giraph unchained: Barrierless asynchronous parallel execution in pregel-like graph processing systems. PVLDB, 8(9):950--961, 2015. Google ScholarDigital Library
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: mining peta-scale graphs. Knowl. Inf. Syst., 27(2):303--325, 2011. Google ScholarDigital Library
K. Lee and L. Liu. Efficient data partitioning model for heterogeneous graphs in the cloud. In SC, page 46, 2013. Google ScholarDigital Library
J. Lin and C. Dyer. Data-Intensive Text Processing with MapReduce. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI), 2010.Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Distributed graphlab: A framework for machine learning in the cloud. CoRR, abs/1204.6078, 2012. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD Conference, pages 135--146, 2010. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
J. Pjesivac-Grbovic, T. Angskun, G. Bosilca, G. E. Fagg, E. Gabriel, and J. Dongarra. Performance analysis of mpi collective operations. pages 127--143, 2007. Google ScholarDigital Library
L. Qin, J. Yu, L. Chang, H. Cheng, C. Zhang, and X. Lin. Scalable big graph processing in mapreduce. In SIGMOD Conference, pages 827--838, 2014. Google ScholarDigital Library
M. Redekopp, Y. Simmhan, and V. K. Prasanna. Optimizations and analysis of bsp graph processing models on public clouds. Parallel and Distributed Processing Symposium, International, 0:203--214, 2013. Google ScholarDigital Library
S. Redner. How popular is your paper? an empirical study of the citation distribution. European Physical Journal B, 4(2):131--134, Aug. 1998.Google ScholarCross Ref
S. Salihoglu and J. Widom. Gps: A graph processing system. In SSDBM, 2013. Google ScholarDigital Library
S. Salihoglu and J. Widom. Optimizing graph algorithms on pregel-like systems. In PVLDB, 2014. Google ScholarDigital Library
N. Satish, N. Sundaram, M. M. A. Patwary, J. Seo, J. Park, M. A. Hassaan, S. Sengupta, Z. Yin, and P. Dubey. Navigating the maze of graph analytics frameworks using massive graph datasets. In SIGMOD Conference, pages 979--990, 2014. Google ScholarDigital Library
S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng. Hama: An efficient matrix computation with the mapreduce framework. In CloudCom, pages 721--726, 2010. Google ScholarDigital Library
H. Shang and M. Kitsuregawa. Efficient breadth-first search on large graphs with skewed degree distributions. In Joint 2013 EDBT/ICDT Conferences, pages 311--322, 2013. Google ScholarDigital Library
B. Shao, H. Wang, and Y. Li. Trinity: A distributed graph engine on a memory cloud. In SIGMOD Conference, pages 505--516, 2013. Google ScholarDigital Library
S. Tasci and M. Demirbas. Giraphx: Parallel yet serializable large-scale graph processing. In Proceedings of the 19th International Conference on Parallel Processing, Euro-Par'13, pages 458--469, Berlin, Heidelberg, 2013. Springer-Verlag. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE, pages 996--1005, 2010.Google ScholarCross Ref
Y. Tian, A. Balmin, S. A. Corsten, S. Tatikonda, and J. McPherson. From "think like a vertex" to "think like a graph". PVLDB, 7(3):193--204, 2013. Google ScholarDigital Library
K. Ueno and T. Suzumura. 2d partitioning based graph search for the graph500 benchmark. In IPDPS Workshops, pages 1925--1931, 2012. Google ScholarDigital Library
L. G. Valiant. A bridging model for parallel computation. Commun. ACM, 33(8):103--111, Aug. 1990. Google ScholarDigital Library
T. White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale. O'Reilly, 2012. Google ScholarDigital Library
R. S. Xin, J. E. Gonzalez, M. J. Franklin, and I. Stoica. Graphx: a resilient distributed graph system on spark. In GRADES, page 2, 2013. Google ScholarDigital Library
A. Yoo, E. Chow, K. W. Henderson, W. M. III, B. Hendrickson, and U. V. "Catalyurek. A scalable distributed parallel breadth-first search algorithm on bluegene"l. In SC, page 25, 2005. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on HotCloud, pages 10--10, 2010 Google ScholarDigital Library

Index Terms

Towards Scale-out Capability on Social Graphs
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

k-tuple domination in graphs

In a graph G, a vertex is said to dominate itself and all of its neighbors. For a fixed positive integer k, the k-tuple domination problem is to find a minimum sized vertex subset in a graph such that every vertex in the graph is dominated by at least k ...
Read More
Towards scalable distributed graph database engine for hybrid clouds
DataCloud '14: Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds

Large graph data management and mining in clouds has become an important issue in recent times. We propose Acacia which is a distributed graph database engine for scalable handling of such large graph data. Acacia operates between the boundaries of ...
Read More
The clique operator on circular-arc graphs

A circular-arc graphG is the intersection graph of a collection of arcs on the circle and such a collection is called a model of G. Say that the model is proper when no arc of the collection contains another one, it is Helly when the arcs satisfy the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
October 2015
1998 pages
ISBN:9781450337946
DOI:10.1145/2806416
General Chairs:
James Bailey
The University of Melbourne
,
Alistair Moffat
The University of Melbourne
,
Program Chairs:
Charu C. Aggarwal
IBM
,
Maarten de Rijke
University of Amsterdam
,
Ravi Kumar
Google
,
Vanessa Murdock
Microsoft
,
Timos Sellis
RMIT University
,
Jeffrey Xu Yu
Chinese University of Hong Kong
Copyright © 2015 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 October 2015
Check for updates
Author Tags
algorithms
distributed databases
experimentation
graph databases
performance
social networks
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '15 Paper Acceptance Rate165of646submissions,26%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 437
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards Scale-out Capability on Social Graphs

CIKM '15: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

k-tuple domination in graphs

Towards scalable distributed graph database engine for hybrid clouds

The clique operator on circular-arc graphs