ABSTRACT
Nowadays, Online Social Networks (OSNs) have become dramatically popular and the study of social graphs attracts the interests of a large number of researchers. One critical challenge is the huge size of the social graph, which makes the graph analyzing or even the data crawling incredibly time consuming, and sometimes impossible to be completed. Thus, graph sampling algorithms have been introduced to obtain a smaller subgraph which reflects the properties of the original graph well. Breadth-First Sampling (BFS) is widely used in graph sampling, but it is biased towards high-degree vertices during the process of sampling. Besides, Metropolis-Hasting Random Walk (MHRW), which is proposed to get unbiased samples of the social graph, requires the graph to be well connected. In this paper, we propose a vertex sampling algorithm, so-called Albatross Sampling (AS), which introduces random jump strategy into MHRW during the sampling process. The embedded random jump makes the sampling procedure more flexible and avoids being trapped in some locally well connected part. According to our evaluation, we find that no matter using tightly or loosely connected graphs, AS performs significantly better than MHRW and BFS. On the one hand, AS estimates the degree distribution with much lower Normalized Mean Square Error (NMSE) by consuming the same resource budget. On the other hand, to get an acceptable estimation of the degree distribution, AS requires much less resource budget.
- Project of Sampling Social Graphs. http://code.google.com/p/sampling-social-graphs/, 2011.Google Scholar
- Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. Analysis of Topological Characteristics of Huge Online Social Networking Services. In Proc. of WWW, 2007. Google ScholarDigital Library
- ASU. Social Computing Data Repository at Arizona State University. http://socialcomputing.asu.edu/pages/home.Google Scholar
- F. Benevenuto, T. Rodrigues, M. Cha, and V. Almeida. Characterizing User Behavior in Online Social Networks. In Proc. of ACM IMC, 2009. Google ScholarDigital Library
- R. Diestel. Graph Theory. Springer-Verlag, 2010.Google Scholar
- N. Eagle, A. S. Pentland, and D. Lazer. Inferring friendship network structure by using mobile phone data. PNAS, 106(36):15274--15278, August 2009.Google ScholarCross Ref
- Geek.com. Twitter reaches 200 million users and 110 million tweets per day. http://www.geek.com/articles/news/twitter-reaches-200-million-users-and-110-million-tweets-per-day-20110120/.Google Scholar
- M. Gjoka. Measurement of online social networks. UC Irvine PhD Thesis, 2010. Google ScholarDigital Library
- M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Walking in Facebook: A Case Study of Unbiased Sampling of Osns. In Proc. of IEEE INFOCOM, 2010. Google ScholarDigital Library
- M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou. Practical recommendations on sampling osn users by crawling the social graph. JSAC special issue on Measurement of Internet Topologies, 2011.Google Scholar
- H. Kwak, C. Lee, H. Park, and S. Moon. What is Twitter, a Social Network or a News Media? In Proc. of WWW, 2010. Google ScholarDigital Library
- J. Leskovec and C. Faloutsos. Sampling from Large Graphs. In Proc. of ACM SIGKDD, 2006. Google ScholarDigital Library
- D. A. Levin, Y. Peres, and E. L. Wilmer. Markov Chains and Mixing Times. American Mathematical Society, 2008.Google ScholarCross Ref
- L. Lovasz. Random walks on graphs: a survey. Combinatorics, 1993.Google Scholar
- N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculation by fast computing machines. J. Chem. Physics, 106(21), August 1953.Google Scholar
- A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and Analysis of Online Social Networks. In Proc. of ACM IMC, volume 106, October 2007. Google ScholarDigital Library
- M. Newman. Assortative mixing in networks. Phys. Rev. Lett., 89, 2002.Google Scholar
- B. Ribeiro, W. Gauvin, B. Liu, and D. Towsley. On Myspace account spans and double Pareto-like distribution of friends. UMass Amherst Technical Report, (UM-CS-2010-001), January 2010.Google Scholar
- B. Ribeiro and D. Towsley. Estimating and Sampling Graphs with Multidimensional Random Walks. In Proc. of ACM IMC, November 2010. Google ScholarDigital Library
- Stanford. Stanford large network dataset collection. http://snap.stanford.edu/data/index.html/.Google Scholar
- T. Wang, Y. Chen, Z. Zhang, T. Xu, L. Jin, P. Hui, B. Deng, and X. Li. Understanding graph sampling algorithms for social network analysis. In the 3rd ICDCS Workshop on Simplifying Complex Networks for Practitioners, June 2011. Google ScholarDigital Library
- D. J. Watts and S. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393(6684), June 1998.Google Scholar
- C. Wilson, B. Boe, A. Sala, K. P. N. Puttaswamy, and B. Y. Zhao. User Interactions in Social Networks and their Implications. In Proc. of ACM EuroSys, 2009. Google ScholarDigital Library
Index Terms
- Albatross sampling: robust and effective hybrid vertex sampling for social graphs
Recommendations
Unbiased sampling in directed social graph
SIGCOMM '10: Proceedings of the ACM SIGCOMM 2010 conferenceMicroblogging services, such as Twitter, are among the most important online social networks(OSNs). Different from OSNs such as Facebook, the topology of microblogging service is a directed graph instead of an undirected graph. Recently, due to the ...
Sampling in online social networks
SAC '14: Proceedings of the 29th Annual ACM Symposium on Applied ComputingIn this paper, we propose a new graph sampling method for online social networks that achieves the following. First, a sample graph should reflect the ratio between the number of nodes and the number of edges of the original graph. Second, a sample ...
Walking on a graph with a magnifying glass: stratified sampling via weighted random walks
SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systemsOur objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling ...
Comments