skip to main content
research-article

Unsupervised Domain Ranking in Large-Scale Web Crawls

Published:27 September 2018Publication History
Skip Abstract Section

Abstract

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

References

  1. Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. 2008. Web spam identification through content and hyperlinks. In Workshop on Adversarial Information Retrieval on the Web. 41--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page importance computation. In International Conference on World Wide Web (WWW). 280--290. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Sarker Tanzir Ahmed and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In IEEE Conference on Computer Communications (INFOCOM). 1598--1606.Google ScholarGoogle ScholarCross RefCross Ref
  4. alexa.com. Retrieved in 2015 from http://www.alexa.com/.Google ScholarGoogle Scholar
  5. Jesse Alpert and Nissan Hajaj. 2008. We Knew the Web Was Big... (July 2008). Retrieved from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.Google ScholarGoogle Scholar
  6. Shatlyk Ashyralyyev, B. Barla Cambazoglu, and Cevdet Aykanat. 2013. Incorporating the surfing behavior of web users into pagerank. In ACM International Conference on Information and Knowledge Management (CIKM). 2351--2356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling a country: Better strategies than breadth-first for web page ordering. In International Conference on World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Xiao Bai, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Discovering URLs through user feedback. In ACM International Conference on Information and Knowledge Management (CIKM). 77--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Link-based characterization and detection of web spam. In Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  10. Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Workshop on Web Mining and Web Usage Analysis (WebKDD).Google ScholarGoogle Scholar
  11. Andras Benczur, Karoly Csalogany, and Tamas Sarlos. 2006. Link-based similarity search to fight web spam. In Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  12. Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. Spamrank -- Fully automatic link spam detection. In Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  13. Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (July 2004), 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International Conference on World Wide Web (WWW). 107--117. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Carlos Castillo, Mauricio Marin, Andrea Rodriguez, and Ricardo Baeza-Yates. 2004. Scheduling algorithms for web crawling. In Latin American Web Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. James Caverlee and Ling Liu. 2007. Countering web spam with credibility-based link analysis. In ACM Symposium on Principles of Distributed Computing (PODC). 157--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. James Caverlee, Steve Webb, and Ling Liu. 2007. Spam-resilient web rankings via influence throttling. In International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarGoogle ScholarCross RefCross Ref
  18. Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 417 (March 1992), 210--217.Google ScholarGoogle ScholarCross RefCross Ref
  19. Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. I/O-efficient techniques for computing pagerank. In International Conference on Information and Knowledge Management (CIKM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2004. Local methods for estimating pagerank values. In International Conference on Information and Knowledge Management (CIKM). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, and Sriram Raghavan Gary Wesley. 2006. Stanford webbase components and applications. ACM Transactions on Internet Technology 6, 2 (May 2006), 153--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In International Conference on World Wide Web (WWW). 161--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Young-Joo Chung, Masashi Toyoda, and Masaru Kitsuregawa. 2009. A study of link farm distribution and evolution using a time series of web snapshots. In Workshop on Adversarial Information Retrieval on the Web. 9--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. ClueWeb09 Dataset. Retrieved in 2015 from http://www.lemurproject.org/clueweb09/.Google ScholarGoogle Scholar
  25. CLuE cluster. Retrieved from http://www.nsf.gov/cise/clue/index.jsp.Google ScholarGoogle Scholar
  26. ClueWeb09 documentation. Retrieved in 2015 from http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html.Google ScholarGoogle Scholar
  27. Isabel Drost and Tobias Scheffer. 2005. Thwarting the nigritude ultramaring: Learning to identify link spam. In European Conference on Machine Learning (ECML). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. N. Duffield, C. Lund, and M. Thorup. 2003. Estimating flow distributions from sampled flow statistics. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 325--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In International Conference on World Wide Web (WWW). 106--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. David Eichmann. 1994. The RBSE spider -- Balancing effective search against web load. In International Conference on World Wide Web (WWW).Google ScholarGoogle ScholarCross RefCross Ref
  31. Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. 2004. Ranking the web frontier. In International Conference on World Wide Web (WWW). 309--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma. 2006. AggregateRank: Bringing order to web sites. In ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In International Workshop on the Web and Databases (WebDB). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Google Toolbar. Retrieved in 2008 from http://toolbar.google.com.Google ScholarGoogle Scholar
  35. IRLbot PLD graph. Retrieved in 2018 from http://irl.cs.tamu.edu/projects/web/.Google ScholarGoogle Scholar
  36. D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. 2004. How to build a webfountain: An architecture for very large-scale text analytics. IBM Systems Journal 43, 1 (2004), 64--77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection based on mass estimation. In International Conference on Very Large Data Bases (VLDB). 439--450. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In International Conference on Very Large Data Bases (VLDB). 517--528. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web spam taxonomy. In Workshop on Adversarial Information Retrieval on the Web. 39--47.Google ScholarGoogle Scholar
  40. Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustRank. In International Conference on Very Large Data Bases (VLDB). 576--587. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Younes Hafri and Chabane Djeraba. 2004. High performance crawling system. In ACM International Workshop on Multimedia Information Retrieval (MIR). 299--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (Dec. 1999), 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. In International Conference on World Wide Web (WWW). 277--293. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Internet Archive. Retrieved in 2018 from http://archive.org/.Google ScholarGoogle Scholar
  45. Melody Ivory and Marti Hearst. 2002. Statistical profiles of highly-rated web sites. In Conference on Human Factors in Computing Systems (CHI). 367--374. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Kasom Koht-arsa and Surasak Sanguanpong. 2002. High performance large scale web spider architecture. In IEEE International Symposium on Communications and Information Technologies (ISCIT).Google ScholarGoogle Scholar
  47. Laboratory for Web Algorithmics. Retrieved in 2018 from http://law.dsi.unimi.it/.Google ScholarGoogle Scholar
  48. Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3, 3 (June 2009), 1--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (Apr. 2001), 131--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Yanhong Li. 1998. Toward a qualitative search engine. IEEE Internet Computing 2, 4 (July 1998), 24--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting web users vote for page importance. In ACM SIGIR Conference on Research and Development in Information Retrieval. 451--458. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In International Conference on World Wide Web (WWW).Google ScholarGoogle ScholarCross RefCross Ref
  53. Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Workshop on Adversarial Information Retrieval on the Web.Google ScholarGoogle Scholar
  54. Mozilla. 2013. Public Suffix List. (May 2013). Retrieved from http://publicsuffix.org/.Google ScholarGoogle Scholar
  55. Marc Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq SRC. Retrieved from http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.Google ScholarGoogle Scholar
  56. Marc Najork and Janet L. Wiener. 2001. Breadth-first search crawling yields high-quality pages. In International Conference on World Wide Web (WWW). 114--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. 2005. Object-level ranking: Bringing order to web objects. In International Conference on World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In International Conference on World Wide Web (WWW). 83--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Nutch. Retrieved in 2018 from http://nutch.apache.org/.Google ScholarGoogle Scholar
  60. Jothi Padmanabhan. 2010. Introduction to Grid Computing via Map-Reduce 8 Hadoop. (Dec. 2010). Retrieved from http://internationalnetworking.iu.edu/sites/internationalnetworking.iu.edu/files/indo-us-preso.pdf.Google ScholarGoogle Scholar
  61. Brian Pinkerton. 1994. Finding what people want: Experiences with the web crawler. In International Conference on World Wide Web (WWW).Google ScholarGoogle Scholar
  62. Brian Pinkerton. 2000. WebCrawler: Finding What People Want. Ph.D. Dissertation. University of Washington. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jakub Piskorski, Marcin Sydow, and Dawid Weiss. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Matthew Richardson, Amit Prakash, and Eric Brill. 2006. Beyond Pagerank: Machine learning for static ranking. In International Conference on World Wide Web (WWW). 707--715. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, and Kazuyuki Aihara. 2007. A large-scale study of link spam detection by graph algorithms. In Workshop on Adversarial Information Retrieval on the Web. 45--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE). 357--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. Amit Signhal. 2012. Breakfast with Google’s Search Team. (Aug. 2012). Retrieved from https://www.youtube.com/watch?v=8a2VmxqFg8A.Google ScholarGoogle Scholar
  68. Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Information Retrieval. 126--142.Google ScholarGoogle Scholar
  69. William Stearns. 2008. SpamAssassin Domain Blacklist. (Aug. 2008). Retrieved from http://www.stearns.org/sa-blacklist/.Google ScholarGoogle Scholar
  70. T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB). 67--72.Google ScholarGoogle Scholar
  71. The Stanford WebBase Project. 2008. WebBase Archive. (Sept. 2008). Retrieved from http://dbpubs.stanford.edu:8091/∼testbed/doc2/WebBase/.Google ScholarGoogle Scholar
  72. John Tomlin. 2003. A new paradigm for ranking pages on the world wide web. In International Conference on World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Trystan Upstill, Nick Craswell, and David Hawking. 2003. Predicting fame and fortune: PageRank or indegree. In 8th Australasian Document Computing Symposium.Google ScholarGoogle Scholar
  74. X. Wang, X. Li, and D. Loguinov. 2011. Modeling residual-geometric flow sampling. In IEEE Conference on Computer Communications (INFOCOM). 1808--1816.Google ScholarGoogle Scholar
  75. Steve Webb, James Caverlee, and Calton Pu. 2007. Characterizing web spam using content and HTTP session analysis. In Conference on Email and Anti-Spam (CEAS).Google ScholarGoogle Scholar
  76. Steve Webb, James Caverlee, and Calton Pu. 2008. Predicting web spam with HTTP session information. In International Conference on Information and Knowledge Management (CIKM). 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Baoning Wu and Kumar Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Workshop on Adversarial Information Retrieval on the Web. 37--44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Baoning Wu and Brian Davison. 2005. Identifying link farm spam pages. In International Conference on World Wide Web (WWW). Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Baoning Wu, Vinay Goel, and Brian Davison. 2006. Topical trustrank: Using topicality to combat web spam. In International Conference on World Wide Web (WWW). 63--72. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. J. Wu and K. Aberer. 2004. Using Siterank for decentralized computation of web document ranking. In Adaptive Hypermedia. 265--274.Google ScholarGoogle Scholar
  81. Gui-Rong Xue, Qiang Yang, Hua-Jun Zeng, Yong Yu, and Zheng Chen. 2005. Exploiting the hierarchical structure for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Sandeep Yadav, Ashwath Kumar Krishna Reddy, A. L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In ACM SIGCOMM Internet Measurement Conference (IMC). ACM, 48--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Yahoo Research Barcelona. 2008. Datasets for Research on Web Spam Detection. (Sept. 2008). Retrieved from http://barcelona.research.yahoo.net/webspam/datasets/.Google ScholarGoogle Scholar
  84. Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo. 2002. Architectural design and evaluation of an efficient web-crawling system. Journal of Systems and Software 60, 3 (Feb. 2002), 185--193. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Unsupervised Domain Ranking in Large-Scale Web Crawls

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on the Web
        ACM Transactions on the Web  Volume 12, Issue 4
        November 2018
        215 pages
        ISSN:1559-1131
        EISSN:1559-114X
        DOI:10.1145/3281744
        Issue’s Table of Contents

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 September 2018
        • Accepted: 1 July 2018
        • Revised: 1 May 2018
        • Received: 1 October 2017
        Published in tweb Volume 12, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed
      • Article Metrics

        • Downloads (Last 12 months)14
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader