ABSTRACT
A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.
- M. Dasdan, D. Pavlovski, and A. Bhattacharjee. A large scale study of sitemap usage. Available by email, Jan 2008.Google Scholar
- M. Koster. A method for web robots control. The internet draft, The Internet Engineering Task Force (IETF), 1996.Google Scholar
- The sitemap protocol. URL: http://www.sitemaps.org, Apr 2007.Google Scholar
- Y. Sun, Z. Zhuang, I. G. Councill, and C. L. Giles. Determining bias to search engines from robots.txt. In Proc. of Int. Conf. on Web Intel. WI), pages 149--55. IEEE/WIC/ACM, 2007. Google ScholarDigital Library
- Y. Sun, Z. Zhuang, and C. L. Giles. A large scale study of robots.txt. In Proc. of Int. Conf. on World Wide Web (WWW), pages 1123--4. ACM, 2007. Google ScholarDigital Library
Index Terms
- A larger scale study of robots.txt
Recommendations
Measuring the web crawler ethics
WWW '10: Proceedings of the 19th international conference on World wide webWeb crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics ...
A large-scale study of robots.txt
WWW '07: Proceedings of the 16th international conference on World Wide WebSearch engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the ...
BotSeer: An Automated Information System for Analyzing Web Robots
ICWE '08: Proceedings of the 2008 Eighth International Conference on Web EngineeringRobots.txt files are vital to the web since they are supposed to regulate what search engines can and cannot crawl. We present BotSeer, a Web-based information system and search tool that provides resources and services for researching Web robots and ...
Comments