ABSTRACT
Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.
We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.
- N. Alon, R. Yuster, and U. Zwick. Color-coding: A new method for finding simple paths, cycles and other small subgraphs within large graphs. In Proceedings of the Twenty-sixth Annual ACM Symposium on Theory of Computing, STOC '94, pages 326--335, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
- L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD'08, pages 16--24, 2008. Google ScholarDigital Library
- J. W. Berry, L. K. Fostvedt, D. J. Nordman, C. A. Phillips, C. Seshadhri, and A. G. Wilson. Why do simple algorithms for triangle enumeration work in the real world? In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ITCS '14, pages 225--234, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- N. Betzler, R. van Bevern, M. R. Fellows, C. Komusiewicz, and R. Niedermeier. Parameterized algorithmics for finding connected motifs in biological networks. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1296--1308, 2011. Google ScholarDigital Library
- M. Bhuiyan, M. Rahman, M. Rahman, and M. A. Hasan. Guise: Uniform sampling of graphlets for large graph analysis. In Proceedings of International Conference on Data Mining, pages 91--100, 2012. Google ScholarDigital Library
- M. A. Bhuiyan and M. A. Hasan. Mirage: An iterative mapreduce based frequent subgraph mining algorithm. Technical report, arXiv, 2013. http://arxiv.org/pdf/1307.5894.pdf.Google Scholar
- R. Burt. Structural holes and good ideas. American Journal of Sociology, 110(2):349--399, 2004.Google ScholarCross Ref
- H. Chernoff. A measure of asymptotic efficiency for test of a hypothesis based on the sum of observations. Annals of mathematical statistics, 23:493--507, 1952.Google Scholar
- N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. Comput., 14:210--223, 1985. Google ScholarDigital Library
- J. Coleman. Social capital in the creation of human capital. American Journal of Sociology, 94:S95--S120, 1988.Google ScholarCross Ref
- D. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009. Google ScholarDigital Library
- N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM'12, 2012. Google ScholarDigital Library
- G. Fagiolo. Clustering in complex directed networks. Phys. Rev. E, 76:026107, Aug 2007.Google ScholarCross Ref
- K. Faust. A puzzle concerning triads in social networks: Graph constraints and the triad census. Social Networks, 32(3):221--233, 2010.Google ScholarCross Ref
- M. Gonen and Y. Shavitt. Approximating the number of network motifs. Internet Mathematics, 6(3):349--372, 2009.Google ScholarCross Ref
- D. Hales and S. Arteconi. Motifs in evolving cooperative networks look like protein structure networks. NHM, 3(2):239--249, 2008.Google ScholarCross Ref
- W. Hoeffding. Probability inequalities for sums of bounded random variables. J. American Statistical Association, 58:13--30, 1963.Google ScholarCross Ref
- P. Holland and S. Leinhardt. A method for detecting structure in sociometric data. American Journal of Sociology, 76:492--513, 1970.Google ScholarCross Ref
- F. Hormozdiari, P. Berenbrink, N. Pryulj, and S. C. Sahinalp. Not all scale-free networks are born equal: The role of the seed graph in ppi network evolution. PLoS Computational Biology, 118, 2007.Google ScholarCross Ref
- S. Itzkovitz, R. Levitt, N. Kashtan, R. Milo, M. Itzkovitz, and U. Alon. Coarse-graining and self-dissimilarity of complex networks. 71(016127), January 2005.Google Scholar
- T. G. Kolda, A. Pinar, T. Plantenga, C. Seshadhri, and C. Task. Counting triangles in massive graphs with MapReduce. SIAM Journal of Scientific Computing, 2013. To appear.Google Scholar
- D. Marcus and Y. Shavitt. Efficient counting of network motifs. In ICDCS Workshops, pages 92--98. IEEE Computer Society, 2010. Google ScholarDigital Library
- R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824--827, 2002.Google ScholarCross Ref
- T. Plantenga. Inexact subgraph isomorphism in mapreduce. Journal of Parallel and Distributed Computing, (0), 2012. Google ScholarDigital Library
- A. Portes. Social capital: Its origins and applications in modern sociology. Annual Review of Sociology, 24(1):1--24, 1998.Google ScholarCross Ref
- N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geometric?. Bioinformatics, 20(18):3508--3515, 2004. Google ScholarDigital Library
- M. Rahman, M. A. Bhuiyan, and M. A. Hasan. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering, PP(99), 2014.Google ScholarCross Ref
- T. Schank and D. Wagner. Approximating clustering coefficient and transitivity. Journal of Graph Algorithms and Applications, 9:265--275, 2005.Google ScholarCross Ref
- T. Schank and D. Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In Experimental and Efficient Algorithms, pages 606--609. Springer Berlin / Heidelberg, 2005. Google ScholarDigital Library
- C. Seshadhri, T. G. Kolda, and A. Pinar. Community structure and scale-free collections of Erdos-Renyi graphs. Physical Review E, 85(5):056109, May 2012.Google ScholarCross Ref
- C. Seshadhri, A. Pinar, and T. G. Kolda. Fast triangle counting through wedge sampling. In Proceedings of the SIAM Conference on Data Mining, 2013.Google Scholar
- S. Son, A. Kang, H. Kim, T. Kwon, J. Park, and H. Kim. Analysis of context dependence in social interaction networks of a massively multiplayer online role-playing game. PLoS ONE, 7(4):e33918, 04 2012.Google ScholarCross Ref
- M. Szell and S. Thurner. Measuring social dynamics in a massive multiplayer online game. Social Networks, 32:313--329, 2010.Google ScholarCross Ref
- C. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and C. Faloutsos. Spectral counting of triangles in power-law networks via element-wise sparsification. In ASONAM'09, pages 66--71, 2009. Google ScholarDigital Library
- C. Tsourakakis, M. N. Kolountzakis, and G. Miller. Triangle sparsifiers. J. Graph Algorithms and Applications, 15:703--726, 2011.Google ScholarCross Ref
- C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: counting triangles in massive graphs with a coin. In Knowledge Data and Discovery (KDD), pages 837--846, 2009. Google ScholarDigital Library
- J. Ugander, L. Backstrom, and J. M. Kleinberg. Subgraph frequencies: mapping the empirical and extremal geography of large graph collections. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 1307--1318. International World Wide Web Conferences Steering Committee / ACM, 2013. Google ScholarDigital Library
- A. Walker. Walker, a. j. (1977). "an efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS), 3(3):253--256, 1977. Google ScholarDigital Library
- D. Watts and S. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393:440--442, 1998.Google ScholarCross Ref
- B. Welles, A. Van Devender, and N. Contractor. Is a friend a friend?: Investigating the structure of friendship networks in virtual worlds. In CHI-EA'10, pages 4027--4032, 2010. Google ScholarDigital Library
- E. Wong, B. Baur, S. Quader, and C.-H. Huang. Biological network motif detection: principles and practice. Briefings in Bioinformatics, 13(2):202--215, 2012.Google ScholarCross Ref
- Z. Zhao, G. Wang, A. Butt, M. Khan, V. S. A. Kumar, and M. Marathe. Sahad: Subgraph analysis in massive networks using hadoop. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS), pages 390--401, 2012. Google ScholarDigital Library
- Stanford Network Analysis Project (SNAP). Available at http://snap.stanford.edu/.Google Scholar
Index Terms
- Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts
Recommendations
ESCAPE: Efficiently Counting All 5-Vertex Subgraphs
WWW '17: Proceedings of the 26th International Conference on World Wide WebCounting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-...
Scalable Subgraph Counting: The Methods Behind The Madness
WWW '19: Companion Proceedings of The 2019 World Wide Web ConferenceSubgraph counting is a fundamental problem in graph analysis that finds use in a wide array of applications. The basic problem is to count or approximate the occurrences of a small subgraph (the pattern) in a large graph (the dataset). Subgraph counting ...
A Note on Non-reconstructible 3-Hypergraphs
Non-reconstructible 3-hypergraphs are studied. A number of counting lemmas are proved for the subgraphs and sub-hypergraphs of a 3-hypergraph. A computer search is used to find all non-reconstructible 3-hypergraphs on at most 8 vertices and 11 triples. ...
Comments