skip to main content
10.1145/775047.775059acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

ANF: a fast and scalable tool for data mining in massive graphs

Published:23 July 2002Publication History

ABSTRACT

Graphs are an increasingly important data source, with such important graphs as the Internet and the Web. Other familiar graphs include CAD circuits, phone records, gene sequences, city streets, social networks and academic citations. Any kind of relationship, such as actors appearing in movies, can be represented as a graph. This work presents a data mining tool, called ANF, that can quickly answer a number of interesting questions on graph-represented data, such as the following. How robust is the Internet to failures? What are the most influential database papers? Are there gender differences in movie appearance patterns? At its core, ANF is based on a fast and memory-efficient approach for approximating the complete "neighbourhood function" for a graph. For the Internet graph (268K nodes), ANF's highly-accurate approximation is more than 700 times faster than the exact computation. This reduces the running time from nearly a day to a matter of a minute or two, allowing users to perform ad hoc drill-down tasks and to repeatedly answer questions about changing data sources. To enable this drill-down, ANF employs new techniques for approximating neighbourhood-type functions for graphs with distinguished nodes and/or edges. When compared to the best existing approximation, ANF's approach is both faster and more accurate, given the same resources. Additionally, unlike previous approaches, ANF scales gracefully to handle disk resident graphs. Finally, we present some of our results from mining large graphs using ANF.

References

  1. L. A. Adamic. The small world Web. In Proceedings of the European Conf. on Digital Libraries, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1--7):107--117, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Broder, R. Kumar, F. Maghoul, P. Raghavan, and R. Stata. Graph structure in the Web. In Proceedings of the 9th International World Wide Web Conference, pages 247--256, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Cohen. Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and System Sciences, 55(3):441--453, December 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Cook and Holder. Graph-based data mining. ISTA: Intelligent Systems & their applications, 15, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. CORA search engine. http://www.cora.whizbang.com.]]Google ScholarGoogle Scholar
  7. P. Domingos and M. Richardson. Mining the network value of customers. In KDD-2001, pages 57--66, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In SIGCOMM, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31:182--209, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. IMDB. http://www.imdb.com.]]Google ScholarGoogle Scholar
  11. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In PDKK, pages 13--23, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. http://cs.bell-labs.com/who/ches/map/.]]Google ScholarGoogle Scholar
  13. S. R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar, A. Tomkins, and E. Upfal. The Web as a graph. In ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 1--10, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. J. Lipton and J. F. Naughton. Estimating the size of generalized transitive closures. In Proceedings of 15th International Conference on Very Large Data Bases, pages 315--326, 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. H. Nodine, M. T. Goodrich, and J. S. Vitter. Blocking for external graph searching. In Proc. ACM PODS Conference (PODS-93), pages 222--232, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. R. Palmer, G. Siganos, M. Faloutsos, C. Faloutsos, and P. Gibbons. The connectivity and fault-tolerance of the Internet topology. In Workshop on Network-Related Data Management (NRDM-2001), 2001.]]Google ScholarGoogle Scholar
  17. C. R. Palmer and J. G. Steffan. Generating network toplogies that obey power laws. In IEEE Globecom 2000, 2000.]]Google ScholarGoogle Scholar
  18. G. Salton and M. MeGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. http://www.isi.edu/scan/mercator/maps.html.]]Google ScholarGoogle Scholar
  20. S. L. Tauro, C. Palmer, G. Siganos, and M. Faloutsos. A simple conceptual model for the Internet topology. In IEEE Globecom 2001, 2001.]]Google ScholarGoogle Scholar

Index Terms

  1. ANF: a fast and scalable tool for data mining in massive graphs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            KDD '02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining
            July 2002
            719 pages
            ISBN:158113567X
            DOI:10.1145/775047

            Copyright © 2002 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 23 July 2002

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            KDD '02 Paper Acceptance Rate44of307submissions,14%Overall Acceptance Rate1,133of8,635submissions,13%

            Upcoming Conference

            KDD '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader