skip to main content
10.1145/1367497.1367711acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
poster

A larger scale study of robots.txt

Published:21 April 2008Publication History

ABSTRACT

A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.

References

  1. M. Dasdan, D. Pavlovski, and A. Bhattacharjee. A large scale study of sitemap usage. Available by email, Jan 2008.Google ScholarGoogle Scholar
  2. M. Koster. A method for web robots control. The internet draft, The Internet Engineering Task Force (IETF), 1996.Google ScholarGoogle Scholar
  3. The sitemap protocol. URL: http://www.sitemaps.org, Apr 2007.Google ScholarGoogle Scholar
  4. Y. Sun, Z. Zhuang, I. G. Councill, and C. L. Giles. Determining bias to search engines from robots.txt. In Proc. of Int. Conf. on Web Intel. WI), pages 149--55. IEEE/WIC/ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Y. Sun, Z. Zhuang, and C. L. Giles. A large scale study of robots.txt. In Proc. of Int. Conf. on World Wide Web (WWW), pages 1123--4. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A larger scale study of robots.txt

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '08: Proceedings of the 17th international conference on World Wide Web
      April 2008
      1326 pages
      ISBN:9781605580852
      DOI:10.1145/1367497

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 21 April 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • poster

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader