research-article

Unsupervised Domain Ranking in Large-Scale Web Crawls

Authors:
Yi Cui

Texas A8M University, USA

Texas A8M University, USA
View Profile

,
Clint Sparkman

United States Air Force Academy, USA

United States Air Force Academy, USA
View Profile

,
Hsin-Tsang Lee

Microsoft Corp., USA

Microsoft Corp., USA
View Profile

,
Dmitri Loguinov

Texas A8M University, USA

Texas A8M University, USA
View Profile

Authors Info & Claims

ACM Transactions on the Web Volume 12 Issue 4Article No.: 26pp 1–29https://doi.org/10.1145/3182180

Published:27 September 2018Publication History

ACM Transactions on the Web

Abstract

With the proliferation of web spam and infinite autogenerated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. In this work, we assume crawls that produce frontiers orders of magnitude larger than RAM, where sorting of pending URLs is infeasible in real time. Under these constraints, the main objective is to quickly compute domain budgets and decide which of them can be massively crawled. Those ranked at the top of the list receive aggressive crawling allowances, while all other domains are visited at some small default rate. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls: a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

References

Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. 2008. Web spam identification through content and hyperlinks. In Workshop on Adversarial Information Retrieval on the Web. 41--44. Google ScholarDigital Library
Serge Abiteboul, Mihai Preda, and Gregory Cobena. 2003. Adaptive on-line page importance computation. In International Conference on World Wide Web (WWW). 280--290. Google ScholarDigital Library
Sarker Tanzir Ahmed and Dmitri Loguinov. 2015. Around the web in six weeks: Documenting a large-scale crawl. In IEEE Conference on Computer Communications (INFOCOM). 1598--1606.Google ScholarCross Ref
alexa.com. Retrieved in 2015 from http://www.alexa.com/.Google Scholar
Jesse Alpert and Nissan Hajaj. 2008. We Knew the Web Was Big... (July 2008). Retrieved from http://googleblog.blogspot.com/2008/07/we-knew-web-was-big.html.Google Scholar
Shatlyk Ashyralyyev, B. Barla Cambazoglu, and Cevdet Aykanat. 2013. Incorporating the surfing behavior of web users into pagerank. In ACM International Conference on Information and Knowledge Management (CIKM). 2351--2356. Google ScholarDigital Library
Ricardo Baeza-Yates, Carlos Castillo, Mauricio Marin, and Andrea Rodriguez. 2005. Crawling a country: Better strategies than breadth-first for web page ordering. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
Xiao Bai, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Discovering URLs through user feedback. In ACM International Conference on Information and Knowledge Management (CIKM). 77--86. Google ScholarDigital Library
Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Link-based characterization and detection of web spam. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
Luca Becchetti, Carlos Castillo, Debora Donato, Stefano Leonardi, and Ricardo Baeza-Yates. 2006. Using rank propagation and probabilistic counting for link-based spam detection. In Workshop on Web Mining and Web Usage Analysis (WebKDD).Google Scholar
Andras Benczur, Karoly Csalogany, and Tamas Sarlos. 2006. Link-based similarity search to fight web spam. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
Andras A. Benczur, Karoly Csalogany, Tamas Sarlos, and Mate Uher. 2005. Spamrank -- Fully automatic link spam detection. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. UbiCrawler: A scalable fully distributed web crawler. Software: Practice 8 Experience 34, 8 (July 2004), 711--726. Google ScholarDigital Library
Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual web search engine. In International Conference on World Wide Web (WWW). 107--117. Google ScholarDigital Library
Carlos Castillo, Mauricio Marin, Andrea Rodriguez, and Ricardo Baeza-Yates. 2004. Scheduling algorithms for web crawling. In Latin American Web Conference. Google ScholarDigital Library
James Caverlee and Ling Liu. 2007. Countering web spam with credibility-based link analysis. In ACM Symposium on Principles of Distributed Computing (PODC). 157--166. Google ScholarDigital Library
James Caverlee, Steve Webb, and Ling Liu. 2007. Spam-resilient web rankings via influence throttling. In International Parallel and Distributed Processing Symposium (IPDPS).Google ScholarCross Ref
Anne Chao and Shen-Ming Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 87, 417 (March 1992), 210--217.Google ScholarCross Ref
Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2002. I/O-efficient techniques for computing pagerank. In International Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
Yen-Yu Chen, Qingqing Gan, and Torsten Suel. 2004. Local methods for estimating pagerank values. In International Conference on Information and Knowledge Management (CIKM). Google ScholarDigital Library
Junghoo Cho, Hector Garcia-Molina, Taher Haveliwala, Wang Lam, Andreas Paepcke, and Sriram Raghavan Gary Wesley. 2006. Stanford webbase components and applications. ACM Transactions on Internet Technology 6, 2 (May 2006), 153--186. Google ScholarDigital Library
Junghoo Cho, Hector Garcia-Molina, and Lawrence Page. 1998. Efficient crawling through URL ordering. In International Conference on World Wide Web (WWW). 161--172. Google ScholarDigital Library
Young-Joo Chung, Masashi Toyoda, and Masaru Kitsuregawa. 2009. A study of link farm distribution and evolution using a time series of web snapshots. In Workshop on Adversarial Information Retrieval on the Web. 9--16. Google ScholarDigital Library
ClueWeb09 Dataset. Retrieved in 2015 from http://www.lemurproject.org/clueweb09/.Google Scholar
CLuE cluster. Retrieved from http://www.nsf.gov/cise/clue/index.jsp.Google Scholar
ClueWeb09 documentation. Retrieved in 2015 from http://boston.lti.cs.cmu.edu/Data/web08-bst/planning.html.Google Scholar
Isabel Drost and Tobias Scheffer. 2005. Thwarting the nigritude ultramaring: Learning to identify link spam. In European Conference on Machine Learning (ECML). Google ScholarDigital Library
N. Duffield, C. Lund, and M. Thorup. 2003. Estimating flow distributions from sampled flow statistics. In ACM SIGCOMM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications. 325--336. Google ScholarDigital Library
Jenny Edwards, Kevin McCurley, and John Tomlin. 2001. An adaptive model for optimizing performance of an incremental web crawler. In International Conference on World Wide Web (WWW). 106--113. Google ScholarDigital Library
David Eichmann. 1994. The RBSE spider -- Balancing effective search against web load. In International Conference on World Wide Web (WWW).Google ScholarCross Ref
Nadav Eiron, Kevin S. McCurley, and John A. Tomlin. 2004. Ranking the web frontier. In International Conference on World Wide Web (WWW). 309--318. Google ScholarDigital Library
Guang Feng, Tie-Yan Liu, Ying Wang, Ying Bao, Zhiming Ma, Xu-Dong Zhang, and Wei-Ying Ma. 2006. AggregateRank: Bringing order to web sites. In ACM SIGIR Conference on Research and Development in Information Retrieval. 75--82. Google ScholarDigital Library
Dennis Fetterly, Mark Manasse, and Marc Najork. 2004. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In International Workshop on the Web and Databases (WebDB). 1--6. Google ScholarDigital Library
Google Toolbar. Retrieved in 2008 from http://toolbar.google.com.Google Scholar
IRLbot PLD graph. Retrieved in 2018 from http://irl.cs.tamu.edu/projects/web/.Google Scholar
D. Gruhl, L. Chavet, D. Gibson, J. Meyer, P. Pattanayak, A. Tomkins, and J. Zien. 2004. How to build a webfountain: An architecture for very large-scale text analytics. IBM Systems Journal 43, 1 (2004), 64--77. Google ScholarDigital Library
Zoltán Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, and Jan Pedersen. 2006. Link spam detection based on mass estimation. In International Conference on Very Large Data Bases (VLDB). 439--450. Google ScholarDigital Library
Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Link spam alliances. In International Conference on Very Large Data Bases (VLDB). 517--528. Google ScholarDigital Library
Zoltán Gyöngyi and Hector Garcia-Molina. 2005. Web spam taxonomy. In Workshop on Adversarial Information Retrieval on the Web. 39--47.Google Scholar
Zoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. 2004. Combating web spam with trustRank. In International Conference on Very Large Data Bases (VLDB). 576--587. Google ScholarDigital Library
Younes Hafri and Chabane Djeraba. 2004. High performance crawling system. In ACM International Workshop on Multimedia Information Retrieval (MIR). 299--306. Google ScholarDigital Library
Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (Dec. 1999), 219--229. Google ScholarDigital Library
Jun Hirai, Sriram Raghavan, Hector Garcia-Molina, and Andreas Paepcke. 2000. WebBase: A repository of web pages. In International Conference on World Wide Web (WWW). 277--293. Google ScholarDigital Library
Internet Archive. Retrieved in 2018 from http://archive.org/.Google Scholar
Melody Ivory and Marti Hearst. 2002. Statistical profiles of highly-rated web sites. In Conference on Human Factors in Computing Systems (CHI). 367--374. Google ScholarDigital Library
Kasom Koht-arsa and Surasak Sanguanpong. 2002. High performance large scale web spider architecture. In IEEE International Symposium on Communications and Information Technologies (ISCIT).Google Scholar
Laboratory for Web Algorithmics. Retrieved in 2018 from http://law.dsi.unimi.it/.Google Scholar
Hsin-Tsang Lee, Derek Leonard, Xiaoming Wang, and Dmitri Loguinov. 2009. IRLbot: Scaling to 6 billion pages and beyond. ACM Transactions on the Web 3, 3 (June 2009), 1--34. Google ScholarDigital Library
Ronny Lempel and Shlomo Moran. 2001. SALSA: The stochastic approach for link-structure analysis. ACM Transactions on Information Systems 19, 2 (Apr. 2001), 131--160. Google ScholarDigital Library
Yanhong Li. 1998. Toward a qualitative search engine. IEEE Internet Computing 2, 4 (July 1998), 24--29. Google ScholarDigital Library
Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma, Shuyuan He, and Hang Li. 2008. BrowseRank: Letting web users vote for page importance. In ACM SIGIR Conference on Research and Development in Information Retrieval. 451--458. Google ScholarDigital Library
Oliver A. McBryan. 1994. GENVL and WWWW: Tools for taming the web. In International Conference on World Wide Web (WWW).Google ScholarCross Ref
Gilad Mishne, David Carmel, and Ronny Lempel. 2005. Blocking blog spam with language model disagreement. In Workshop on Adversarial Information Retrieval on the Web.Google Scholar
Mozilla. 2013. Public Suffix List. (May 2013). Retrieved from http://publicsuffix.org/.Google Scholar
Marc Najork and Allan Heydon. 2001. High-Performance Web Crawling. Technical Report 173. Compaq SRC. Retrieved from http://www.hpl.hp.com/techreports/Compaq-DEC/SRC-RR-173.pdf.Google Scholar
Marc Najork and Janet L. Wiener. 2001. Breadth-first search crawling yields high-quality pages. In International Conference on World Wide Web (WWW). 114--118. Google ScholarDigital Library
Zaiqing Nie, Yuanzhi Zhang, Ji-Rong Wen, and Wei-Ying Ma. 2005. Object-level ranking: Bringing order to web objects. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
Alexandros Ntoulas, Marc Najork, Mark Manasse, and Dennis Fetterly. 2006. Detecting spam web pages through content analysis. In International Conference on World Wide Web (WWW). 83--92. Google ScholarDigital Library
Nutch. Retrieved in 2018 from http://nutch.apache.org/.Google Scholar
Jothi Padmanabhan. 2010. Introduction to Grid Computing via Map-Reduce 8 Hadoop. (Dec. 2010). Retrieved from http://internationalnetworking.iu.edu/sites/internationalnetworking.iu.edu/files/indo-us-preso.pdf.Google Scholar
Brian Pinkerton. 1994. Finding what people want: Experiences with the web crawler. In International Conference on World Wide Web (WWW).Google Scholar
Brian Pinkerton. 2000. WebCrawler: Finding What People Want. Ph.D. Dissertation. University of Washington. Google ScholarDigital Library
Jakub Piskorski, Marcin Sydow, and Dawid Weiss. 2008. Exploring linguistic features for web spam detection: A preliminary study. In Workshop on Adversarial Information Retrieval on the Web. 25--28. Google ScholarDigital Library
Matthew Richardson, Amit Prakash, and Eric Brill. 2006. Beyond Pagerank: Machine learning for static ranking. In International Conference on World Wide Web (WWW). 707--715. Google ScholarDigital Library
Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, and Kazuyuki Aihara. 2007. A large-scale study of link spam detection by graph algorithms. In Workshop on Adversarial Information Retrieval on the Web. 45--48. Google ScholarDigital Library
Vladislav Shkapenyuk and Torsten Suel. 2002. Design and implementation of a high-performance distributed web crawler. In IEEE International Conference on Data Engineering (ICDE). 357--368. Google ScholarDigital Library
Amit Signhal. 2012. Breakfast with Google’s Search Team. (Aug. 2012). Retrieved from https://www.youtube.com/watch?v=8a2VmxqFg8A.Google Scholar
Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. 2003. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Information Retrieval. 126--142.Google Scholar
William Stearns. 2008. SpamAssassin Domain Blacklist. (Aug. 2008). Retrieved from http://www.stearns.org/sa-blacklist/.Google Scholar
T. Suel, C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, and K. Shanmugasundaram. 2003. ODISSEA: A peer-to-peer architecture for scalable web search and information retrieval. In International Workshop on the Web and Databases (WebDB). 67--72.Google Scholar
The Stanford WebBase Project. 2008. WebBase Archive. (Sept. 2008). Retrieved from http://dbpubs.stanford.edu:8091/&sim;testbed/doc2/WebBase/.Google Scholar
John Tomlin. 2003. A new paradigm for ranking pages on the world wide web. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
Trystan Upstill, Nick Craswell, and David Hawking. 2003. Predicting fame and fortune: PageRank or indegree. In 8th Australasian Document Computing Symposium.Google Scholar
X. Wang, X. Li, and D. Loguinov. 2011. Modeling residual-geometric flow sampling. In IEEE Conference on Computer Communications (INFOCOM). 1808--1816.Google Scholar
Steve Webb, James Caverlee, and Calton Pu. 2007. Characterizing web spam using content and HTTP session analysis. In Conference on Email and Anti-Spam (CEAS).Google Scholar
Steve Webb, James Caverlee, and Calton Pu. 2008. Predicting web spam with HTTP session information. In International Conference on Information and Knowledge Management (CIKM). 339--348. Google ScholarDigital Library
Baoning Wu and Kumar Chellapilla. 2007. Extracting link spam using biased random walks from spam seed sets. In Workshop on Adversarial Information Retrieval on the Web. 37--44. Google ScholarDigital Library
Baoning Wu and Brian Davison. 2005. Identifying link farm spam pages. In International Conference on World Wide Web (WWW). Google ScholarDigital Library
Baoning Wu, Vinay Goel, and Brian Davison. 2006. Topical trustrank: Using topicality to combat web spam. In International Conference on World Wide Web (WWW). 63--72. Google ScholarDigital Library
J. Wu and K. Aberer. 2004. Using Siterank for decentralized computation of web document ranking. In Adaptive Hypermedia. 265--274.Google Scholar
Gui-Rong Xue, Qiang Yang, Hua-Jun Zeng, Yong Yu, and Zheng Chen. 2005. Exploiting the hierarchical structure for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval. 186--193. Google ScholarDigital Library
Sandeep Yadav, Ashwath Kumar Krishna Reddy, A. L. Narasimha Reddy, and Supranamaya Ranjan. 2010. Detecting algorithmically generated malicious domain names. In ACM SIGCOMM Internet Measurement Conference (IMC). ACM, 48--61. Google ScholarDigital Library
Yahoo Research Barcelona. 2008. Datasets for Research on Web Spam Detection. (Sept. 2008). Retrieved from http://barcelona.research.yahoo.net/webspam/datasets/.Google Scholar
Hongfei Yan, Jianyong Wang, Xiaoming Li, and Lin Guo. 2002. Architectural design and evaluation of an efficient web-crawling system. Journal of Systems and Software 60, 3 (Feb. 2002), 185--193. Google ScholarDigital Library

Index Terms

Unsupervised Domain Ranking in Large-Scale Web Crawls
1. Information systems
  1. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Page and site ranking
        Web crawling

Recommendations

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing
WWW '20: Proceedings of The Web Conference 2020

Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at ...
Read More
Ranking web sites using domain ontology concepts

Many web search engines retrieve enormous amounts of irrelevant information in answer to users' queries. The semantic web provides a promising approach to improve search operation. For specific domains, ontologies can capture concepts to help machines ...
Read More
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web Engineering

Web crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on the Web Volume 12, Issue 4
November 2018
215 pages
ISSN:1559-1131
EISSN:1559-114X
DOI:10.1145/3281744
Editor:
Brian D. Davison
Lehigh University, USA
Issue’s Table of Contents
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 September 2018
- Accepted: 1 July 2018
- Revised: 1 May 2018
- Received: 1 October 2017
Published in tweb Volume 12, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Web crawling
frontier prioritization
ranking
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 229
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Unsupervised Domain Ranking in Large-Scale Web Crawls

ACM Transactions on the Web

Abstract

References

Cited By

Index Terms

Recommendations

The Representativeness of Automated Web Crawls as a Surrogate for Human Browsing

Ranking web sites using domain ontology concepts

Current challenges in web crawling