skip to main content
10.1145/2736277.2741101acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

Published:18 May 2015Publication History

ABSTRACT

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. Indeed, even a highly tuned enumeration code takes more than a day on a graph with millions of edges. Most previous work that runs for truly massive graphs employ clusters and massive parallelization.

We provide a sampling algorithm that provably and accurately approximates the frequencies of all 4-vertex pattern subgraphs. Our algorithm is based on a novel technique of 3-path sampling and a special pruning scheme to decrease the variance in estimates. We provide theoretical proofs for the accuracy of our algorithm, and give formal bounds for the error and confidence of our estimates. We perform a detailed empirical study and show that our algorithm provides estimates within 1% relative error for all subpatterns (over a large class of test graphs), while being orders of magnitude faster than enumeration and other sampling based algorithms. Our algorithm takes less than a minute (on a single commodity machine) to process an Orkut social network with 300 million edges.

References

  1. N. Alon, R. Yuster, and U. Zwick. Color-coding: A new method for finding simple paths, cycles and other small subgraphs within large graphs. In Proceedings of the Twenty-sixth Annual ACM Symposium on Theory of Computing, STOC '94, pages 326--335, New York, NY, USA, 1994. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. L. Becchetti, P. Boldi, C. Castillo, and A. Gionis. Efficient semi-streaming algorithms for local triangle counting in massive graphs. In KDD'08, pages 16--24, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. J. W. Berry, L. K. Fostvedt, D. J. Nordman, C. A. Phillips, C. Seshadhri, and A. G. Wilson. Why do simple algorithms for triangle enumeration work in the real world? In Proceedings of the 5th Conference on Innovations in Theoretical Computer Science, ITCS '14, pages 225--234, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. N. Betzler, R. van Bevern, M. R. Fellows, C. Komusiewicz, and R. Niedermeier. Parameterized algorithmics for finding connected motifs in biological networks. IEEE/ACM Trans. Comput. Biology Bioinform., 8(5):1296--1308, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Bhuiyan, M. Rahman, M. Rahman, and M. A. Hasan. Guise: Uniform sampling of graphlets for large graph analysis. In Proceedings of International Conference on Data Mining, pages 91--100, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. A. Bhuiyan and M. A. Hasan. Mirage: An iterative mapreduce based frequent subgraph mining algorithm. Technical report, arXiv, 2013. http://arxiv.org/pdf/1307.5894.pdf.Google ScholarGoogle Scholar
  7. R. Burt. Structural holes and good ideas. American Journal of Sociology, 110(2):349--399, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  8. H. Chernoff. A measure of asymptotic efficiency for test of a hypothesis based on the sum of observations. Annals of mathematical statistics, 23:493--507, 1952.Google ScholarGoogle Scholar
  9. N. Chiba and T. Nishizeki. Arboricity and subgraph listing algorithms. SIAM J. Comput., 14:210--223, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Coleman. Social capital in the creation of human capital. American Journal of Sociology, 94:S95--S120, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis of Randomized Algorithms. Cambridge University Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. Durak, A. Pinar, T. G. Kolda, and C. Seshadhri. Degree relations of triangles in real-world networks and graph models. In CIKM'12, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Fagiolo. Clustering in complex directed networks. Phys. Rev. E, 76:026107, Aug 2007.Google ScholarGoogle ScholarCross RefCross Ref
  14. K. Faust. A puzzle concerning triads in social networks: Graph constraints and the triad census. Social Networks, 32(3):221--233, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. Gonen and Y. Shavitt. Approximating the number of network motifs. Internet Mathematics, 6(3):349--372, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. Hales and S. Arteconi. Motifs in evolving cooperative networks look like protein structure networks. NHM, 3(2):239--249, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  17. W. Hoeffding. Probability inequalities for sums of bounded random variables. J. American Statistical Association, 58:13--30, 1963.Google ScholarGoogle ScholarCross RefCross Ref
  18. P. Holland and S. Leinhardt. A method for detecting structure in sociometric data. American Journal of Sociology, 76:492--513, 1970.Google ScholarGoogle ScholarCross RefCross Ref
  19. F. Hormozdiari, P. Berenbrink, N. Pryulj, and S. C. Sahinalp. Not all scale-free networks are born equal: The role of the seed graph in ppi network evolution. PLoS Computational Biology, 118, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  20. S. Itzkovitz, R. Levitt, N. Kashtan, R. Milo, M. Itzkovitz, and U. Alon. Coarse-graining and self-dissimilarity of complex networks. 71(016127), January 2005.Google ScholarGoogle Scholar
  21. T. G. Kolda, A. Pinar, T. Plantenga, C. Seshadhri, and C. Task. Counting triangles in massive graphs with MapReduce. SIAM Journal of Scientific Computing, 2013. To appear.Google ScholarGoogle Scholar
  22. D. Marcus and Y. Shavitt. Efficient counting of network motifs. In ICDCS Workshops, pages 92--98. IEEE Computer Society, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298(5594):824--827, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  24. T. Plantenga. Inexact subgraph isomorphism in mapreduce. Journal of Parallel and Distributed Computing, (0), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Portes. Social capital: Its origins and applications in modern sociology. Annual Review of Sociology, 24(1):1--24, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  26. N. Przulj, D. G. Corneil, and I. Jurisica. Modeling interactome: scale-free or geometric?. Bioinformatics, 20(18):3508--3515, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Rahman, M. A. Bhuiyan, and M. A. Hasan. Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering, PP(99), 2014.Google ScholarGoogle ScholarCross RefCross Ref
  28. T. Schank and D. Wagner. Approximating clustering coefficient and transitivity. Journal of Graph Algorithms and Applications, 9:265--275, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  29. T. Schank and D. Wagner. Finding, counting and listing all triangles in large graphs, an experimental study. In Experimental and Efficient Algorithms, pages 606--609. Springer Berlin / Heidelberg, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. C. Seshadhri, T. G. Kolda, and A. Pinar. Community structure and scale-free collections of Erdos-Renyi graphs. Physical Review E, 85(5):056109, May 2012.Google ScholarGoogle ScholarCross RefCross Ref
  31. C. Seshadhri, A. Pinar, and T. G. Kolda. Fast triangle counting through wedge sampling. In Proceedings of the SIAM Conference on Data Mining, 2013.Google ScholarGoogle Scholar
  32. S. Son, A. Kang, H. Kim, T. Kwon, J. Park, and H. Kim. Analysis of context dependence in social interaction networks of a massively multiplayer online role-playing game. PLoS ONE, 7(4):e33918, 04 2012.Google ScholarGoogle ScholarCross RefCross Ref
  33. M. Szell and S. Thurner. Measuring social dynamics in a massive multiplayer online game. Social Networks, 32:313--329, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  34. C. Tsourakakis, P. Drineas, E. Michelakis, I. Koutis, and C. Faloutsos. Spectral counting of triangles in power-law networks via element-wise sparsification. In ASONAM'09, pages 66--71, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Tsourakakis, M. N. Kolountzakis, and G. Miller. Triangle sparsifiers. J. Graph Algorithms and Applications, 15:703--726, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  36. C. E. Tsourakakis, U. Kang, G. L. Miller, and C. Faloutsos. Doulion: counting triangles in massive graphs with a coin. In Knowledge Data and Discovery (KDD), pages 837--846, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. J. Ugander, L. Backstrom, and J. M. Kleinberg. Subgraph frequencies: mapping the empirical and extremal geography of large graph collections. In D. Schwabe, V. A. F. Almeida, H. Glaser, R. A. Baeza-Yates, and S. B. Moon, editors, WWW, pages 1307--1318. International World Wide Web Conferences Steering Committee / ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Walker. Walker, a. j. (1977). "an efficient method for generating discrete random variables with general distributions. ACM Transactions on Mathematical Software (TOMS), 3(3):253--256, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Watts and S. Strogatz. Collective dynamics of 'small-world' networks. Nature, 393:440--442, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  40. B. Welles, A. Van Devender, and N. Contractor. Is a friend a friend?: Investigating the structure of friendship networks in virtual worlds. In CHI-EA'10, pages 4027--4032, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. E. Wong, B. Baur, S. Quader, and C.-H. Huang. Biological network motif detection: principles and practice. Briefings in Bioinformatics, 13(2):202--215, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  42. Z. Zhao, G. Wang, A. Butt, M. Khan, V. S. A. Kumar, and M. Marathe. Sahad: Subgraph analysis in massive networks using hadoop. In Proceedings of International Parallel and Distributed Processing Symposium (IPDPS), pages 390--401, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Stanford Network Analysis Project (SNAP). Available at http://snap.stanford.edu/.Google ScholarGoogle Scholar

Index Terms

  1. Path Sampling: A Fast and Provable Method for Estimating 4-Vertex Subgraph Counts

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '15: Proceedings of the 24th International Conference on World Wide Web
      May 2015
      1460 pages
      ISBN:9781450334693

      Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)

      Publisher

      International World Wide Web Conferences Steering Committee

      Republic and Canton of Geneva, Switzerland

      Publication History

      • Published: 18 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader