Abstract
With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.
- Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. 2008. Web spam identification through content and hyperlinks. In Workshop on Adversarial Information Retrieval on the Web. 41--44. Google ScholarDigital Library
- Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page importance computation. In International Conference on World Wide Web (WWW). 280--290. Google ScholarDigital Library
- Sarker Tanzir Ahmed and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In IEEE Conference on Computer Communications (INFOCOM). 1598--1606.Google ScholarCross Ref
- alexa.com. Retrieved in 2015 from http://www.alexa.com/.Google Scholar
- Jesse Alpert and Nissan Hajaj. 2008. We Knew the Web Was Big... (July 2008). Retrieved from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.Google Scholar
- Shatlyk Ashyralyyev, B. Barla Cambazoglu, and Cevdet Aykanat. 2013. Incorporating the surfing behavior of web users into pagerank. In ACM International Conference on Information and Knowledge Management (CIKM). 2351--2356. Google ScholarDigital Library
- Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling a country: Better strategies than breadth-first for web page ordering. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
- Xiao Bai, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Discovering URLs through user feedback. In ACM International Conference on Information and Knowledge Management (CIKM). 77--86. Google ScholarDigital Library
- Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Link-based characterization and detection of web spam. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Workshop on Web Mining and Web Usage Analysis (WebKDD).Google Scholar
- Andras Benczur, Karoly Csalogany, and Tamas Sarlos. 2006. Link-based similarity search to fight web spam. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. Spamrank -- Fully automatic link spam detection. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (July 2004), 711--726. Google ScholarDigital Library
- Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International Conference on World Wide Web (WWW). 107--117. Google ScholarDigital Library
- Carlos Castillo, Mauricio Marin, Andrea Rodriguez, and Ricardo Baeza-Yates. 2004. Scheduling algorithms for web crawling. In Latin American Web Conference. Google ScholarDigital Library
- James Caverlee and Ling Liu. 2007. Countering web spam with credibility-based link analysis. In ACM Symposium on Principles of Distributed Computing (PODC). 157--166. Google ScholarDigital Library
- James Caverlee, Steve Webb, and Ling Liu. 2007. Spam-resilient web rankings via influence throttling. In International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
- Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 417 (March 1992), 210--217.Google ScholarCross Ref
- Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. I/O-efficient techniques for computing pagerank. In International Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
- Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2004. Local methods for estimating pagerank values. In International Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
- Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, and Sriram Raghavan Gary Wesley. 2006. Stanford webbase components and applications. ACM Transactions on Internet Technology 6, 2 (May 2006), 153--186. Google ScholarDigital Library
- Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In International Conference on World Wide Web (WWW). 161--172. Google ScholarDigital Library
- Young-Joo Chung, Masashi Toyoda, and Masaru Kitsuregawa. 2009. A study of link farm distribution and evolution using a time series of web snapshots. In Workshop on Adversarial Information Retrieval on the Web. 9--16. Google ScholarDigital Library
- ClueWeb09 Dataset. Retrieved in 2015 from http://www.lemurproject.org/clueweb09/.Google Scholar
- CLuE cluster. Retrieved from http://www.nsf.gov/cise/clue/index.jsp.Google Scholar
- ClueWeb09 documentation. Retrieved in 2015 from http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html.Google Scholar
- Isabel Drost and Tobias Scheffer. 2005. Thwarting the nigritude ultramaring: Learning to identify link spam. In European Conference on Machine Learning (ECML). Google ScholarDigital Library
- N. Duffield, C. Lund, and M. Thorup. 2003. Estimating flow distributions from sampled flow statistics. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 325--336. Google ScholarDigital Library
- Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In International Conference on World Wide Web (WWW). 106--113. Google ScholarDigital Library
- David Eichmann. 1994. The RBSE spider -- Balancing effective search against web load. In International Conference on World Wide Web (WWW).Google ScholarCross Ref
- Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. 2004. Ranking the web frontier. In International Conference on World Wide Web (WWW). 309--318. Google ScholarDigital Library
- Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma. 2006. AggregateRank: Bringing order to web sites. In ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82. Google ScholarDigital Library
- Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In International Workshop on the Web and Databases (WebDB). 1--6. Google ScholarDigital Library
- Google Toolbar. Retrieved in 2008 from http://toolbar.google.com.Google Scholar
- IRLbot PLD graph. Retrieved in 2018 from http://irl.cs.tamu.edu/projects/web/.Google Scholar
- D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. 2004. How to build a webfountain: An architecture for very large-scale text analytics. IBM Systems Journal 43, 1 (2004), 64--77. Google ScholarDigital Library
- Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection based on mass estimation. In International Conference on Very Large Data Bases (VLDB). 439--450. Google ScholarDigital Library
- Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In International Conference on Very Large Data Bases (VLDB). 517--528. Google ScholarDigital Library
- Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web spam taxonomy. In Workshop on Adversarial Information Retrieval on the Web. 39--47.Google Scholar
- Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustRank. In International Conference on Very Large Data Bases (VLDB). 576--587. Google ScholarDigital Library
- Younes Hafri and Chabane Djeraba. 2004. High performance crawling system. In ACM International Workshop on Multimedia Information Retrieval (MIR). 299--306. Google ScholarDigital Library
- Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (Dec. 1999), 219--229. Google ScholarDigital Library
- Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. In International Conference on World Wide Web (WWW). 277--293. Google ScholarDigital Library
- Internet Archive. Retrieved in 2018 from http://archive.org/.Google Scholar
- Melody Ivory and Marti Hearst. 2002. Statistical profiles of highly-rated web sites. In Conference on Human Factors in Computing Systems (CHI). 367--374. Google ScholarDigital Library
- Kasom Koht-arsa and Surasak Sanguanpong. 2002. High performance large scale web spider architecture. In IEEE International Symposium on Communications and Information Technologies (ISCIT).Google Scholar
- Laboratory for Web Algorithmics. Retrieved in 2018 from http://law.dsi.unimi.it/.Google Scholar
- Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3, 3 (June 2009), 1--34. Google ScholarDigital Library
- Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (Apr. 2001), 131--160. Google ScholarDigital Library
- Yanhong Li. 1998. Toward a qualitative search engine. IEEE Internet Computing 2, 4 (July 1998), 24--29. Google ScholarDigital Library
- Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting web users vote for page importance. In ACM SIGIR Conference on Research and Development in Information Retrieval. 451--458. Google ScholarDigital Library
- Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In International Conference on World Wide Web (WWW).Google ScholarCross Ref
- Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
- Mozilla. 2013. Public Suffix List. (May 2013). Retrieved from http://publicsuffix.org/.Google Scholar
- Marc Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq SRC. Retrieved from http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.Google Scholar
- Marc Najork and Janet L. Wiener. 2001. Breadth-first search crawling yields high-quality pages. In International Conference on World Wide Web (WWW). 114--118. Google ScholarDigital Library
- Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. 2005. Object-level ranking: Bringing order to web objects. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
- Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In International Conference on World Wide Web (WWW). 83--92. Google ScholarDigital Library
- Nutch. Retrieved in 2018 from http://nutch.apache.org/.Google Scholar
- Jothi Padmanabhan. 2010. Introduction to Grid Computing via Map-Reduce 8 Hadoop. (Dec. 2010). Retrieved from http://internationalnetworking.iu.edu/sites/internationalnetworking.iu.edu/files/indo-us-preso.pdf.Google Scholar
- Brian Pinkerton. 1994. Finding what people want: Experiences with the web crawler. In International Conference on World Wide Web (WWW).Google Scholar
- Brian Pinkerton. 2000. WebCrawler: Finding What People Want. Ph.D. Dissertation. University of Washington. Google ScholarDigital Library
- Jakub Piskorski, Marcin Sydow, and Dawid Weiss. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarDigital Library
- Matthew Richardson, Amit Prakash, and Eric Brill. 2006. Beyond Pagerank: Machine learning for static ranking. In International Conference on World Wide Web (WWW). 707--715. Google ScholarDigital Library
- Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, and Kazuyuki Aihara. 2007. A large-scale study of link spam detection by graph algorithms. In Workshop on Adversarial Information Retrieval on the Web. 45--48. Google ScholarDigital Library
- Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE). 357--368. Google ScholarDigital Library
- Amit Signhal. 2012. Breakfast with Google’s Search Team. (Aug. 2012). Retrieved from https://www.youtube.com/watch?v=8a2VmxqFg8A.Google Scholar
- Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Information Retrieval. 126--142.Google Scholar
- William Stearns. 2008. SpamAssassin Domain Blacklist. (Aug. 2008). Retrieved from http://www.stearns.org/sa-blacklist/.Google Scholar
- T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB). 67--72.Google Scholar
- The Stanford WebBase Project. 2008. WebBase Archive. (Sept. 2008). Retrieved from http://dbpubs.stanford.edu:8091/∼testbed/doc2/WebBase/.Google Scholar
- John Tomlin. 2003. A new paradigm for ranking pages on the world wide web. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
- Trystan Upstill, Nick Craswell, and David Hawking. 2003. Predicting fame and fortune: PageRank or indegree. In 8th Australasian Document Computing Symposium.Google Scholar
- X. Wang, X. Li, and D. Loguinov. 2011. Modeling residual-geometric flow sampling. In IEEE Conference on Computer Communications (INFOCOM). 1808--1816.Google Scholar
- Steve Webb, James Caverlee, and Calton Pu. 2007. Characterizing web spam using content and HTTP session analysis. In Conference on Email and Anti-Spam (CEAS).Google Scholar
- Steve Webb, James Caverlee, and Calton Pu. 2008. Predicting web spam with HTTP session information. In International Conference on Information and Knowledge Management (CIKM). 339--348. Google ScholarDigital Library
- Baoning Wu and Kumar Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Workshop on Adversarial Information Retrieval on the Web. 37--44. Google ScholarDigital Library
- Baoning Wu and Brian Davison. 2005. Identifying link farm spam pages. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
- Baoning Wu, Vinay Goel, and Brian Davison. 2006. Topical trustrank: Using topicality to combat web spam. In International Conference on World Wide Web (WWW). 63--72. Google ScholarDigital Library
- J. Wu and K. Aberer. 2004. Using Siterank for decentralized computation of web document ranking. In Adaptive Hypermedia. 265--274.Google Scholar
- Gui-Rong Xue, Qiang Yang, Hua-Jun Zeng, Yong Yu, and Zheng Chen. 2005. Exploiting the hierarchical structure for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarDigital Library
- Sandeep Yadav, Ashwath Kumar Krishna Reddy, A. L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In ACM SIGCOMM Internet Measurement Conference (IMC). ACM, 48--61. Google ScholarDigital Library
- Yahoo Research Barcelona. 2008. Datasets for Research on Web Spam Detection. (Sept. 2008). Retrieved from http://barcelona.research.yahoo.net/webspam/datasets/.Google Scholar
- Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo. 2002. Architectural design and evaluation of an efficient web-crawling system. Journal of Systems and Software 60, 3 (Feb. 2002), 185--193. Google ScholarDigital Library
Index Terms
- Unsupervised Domain Ranking in Large-Scale Web Crawls
Recommendations
The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
WWW '20: Proceedings of The Web Conference 2020Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at ...
Ranking web sites using domain ontology concepts
Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Comments