poster

A larger scale study of robots.txt

Authors:
Santanu Kolay

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Paolo D'Alberto

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Ali Dasdan

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

,
Arnab Bhattacharjee

Yahoo! Inc., Sunnyvale, CA, USA

Yahoo! Inc., Sunnyvale, CA, USA
View Profile

WWW '08: Proceedings of the 17th international conference on World Wide WebApril 2008Pages 1171–1172https://doi.org/10.1145/1367497.1367711

Published:21 April 2008Publication History

WWW '08: Proceedings of the 17th international conference on World Wide Web

Pages 1171–1172

ABSTRACT

A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.

References

M. Dasdan, D. Pavlovski, and A. Bhattacharjee. A large scale study of sitemap usage. Available by email, Jan 2008.Google Scholar
M. Koster. A method for web robots control. The internet draft, The Internet Engineering Task Force (IETF), 1996.Google Scholar
The sitemap protocol. URL: http://www.sitemaps.org, Apr 2007.Google Scholar
Y. Sun, Z. Zhuang, I. G. Councill, and C. L. Giles. Determining bias to search engines from robots.txt. In Proc. of Int. Conf. on Web Intel. WI), pages 149--55. IEEE/WIC/ACM, 2007. Google ScholarDigital Library
Y. Sun, Z. Zhuang, and C. L. Giles. A large scale study of robots.txt. In Proc. of Int. Conf. on World Wide Web (WWW), pages 1123--4. ACM, 2007. Google ScholarDigital Library

Index Terms

A larger scale study of robots.txt
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing

Recommendations

Measuring the web crawler ethics
WWW '10: Proceedings of the 19th international conference on World wide web

Web crawlers are highly automated and seldom regulated manually. The diversity of crawler activities often leads to ethical problems such as spam and service attacks. In this research, quantitative models are proposed to measure the web crawler ethics ...
Read More
A large-scale study of robots.txt
WWW '07: Proceedings of the 16th international conference on World Wide Web

Search engines largely rely on Web robots to collect information from the Web. Due to the unregulated open-access nature of the Web, robot activities are extremely diverse. Such crawling activities can be regulated from the server side by deploying the ...
Read More
BotSeer: An Automated Information System for Analyzing Web Robots
ICWE '08: Proceedings of the 2008 Eighth International Conference on Web Engineering

Robots.txt files are vital to the web since they are supposed to regulate what search engines can and cannot crawl. We present BotSeer, a Web-based information system and search tool that provides resources and services for researching Web robots and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '08: Proceedings of the 17th international conference on World Wide Web
April 2008
1326 pages
ISBN:9781605580852
DOI:10.1145/1367497
General Chairs:
Jinpeng Huai
Beihang University, China
,
Robin Chen
AT&T Labs, USA
,
Hsiao-Wuen Hon
Microsoft Research Asia, China
,
Yunhao Liu
HK University of Science and Technology, Hong Kong
,
Program Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Andrew Tomkins
Yahoo! Research, USA
,
Xiaodong Zhang
The Ohio State University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 April 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
crawler
robots exclusion
robots.txt
search engine
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 374
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A larger scale study of robots.txt

WWW '08: Proceedings of the 17th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Measuring the web crawler ethics

A large-scale study of robots.txt

BotSeer: An Automated Information System for Analyzing Web Robots