Google Formalizes Robots Exclusion Protocol in Effort to Make It an Internet Standard

The Robots Exclusion Protocol (REP) has governed the rules defining how to prevent crawlers from accessing a website since 1994. Now, Google has submitted a draft to the Internet Engineering Task Force (IETF) to make it an Internet Standard. In addition, Google has open sourced its implementation of the protocol.

The proposed REP draft reflects over 20 years of real world experience of relying on robots.txt rules, used both by Googlebot and other major crawlers, as well as about half a billion websites that rely on REP.

Google has left the basic definition of REP unchanged but has defined a number of scenarios that were not considered and extended it for the modern web, Google says. In particular, the new REP is not limited to HTTP and can be used for other protocols, including FTP or Constrained Application Protocol. Another new provision practically limits the maximum size of a robots.txt file by requiring developers to parse at least the first 500 KB of its content. While not out-ruling existing robots.txt files that are larger than that, this requirement aims to reduce load on servers. The new REP also defines how long a robots.txt file is cached, basically formalizing the rule Google has been enforcing for quite some time which sets a maximum cache lifetime of 24 hours when no cache control directive, e.g. HTTP Cache-Control, is specified. This rule has an exception in case a robots.txt file becomes inaccessible due to a server failure, in which case caching can be prolonged to avoid crawling pages that were previously known to be disallowed.

A number of directives that are in use, including crawl-delay, nofollow, and noindex, have not been included in the draft and Google will retire all code that handles such rules by September 1, 2019. This means, in particular, that webmasters that were relying on noindex to prevent a page from entering Google's index, shall look for alternatives. Those include using noindex robots meta tags in HTML, HTTP response headers, or returning a 404 or 410 HTTP status code. Google has also clarified that while the robots.txt Disallow directive does not guarantee a page will not be listed in Google index, they are aiming to make those pages less visible in future if for some reason they get indexed.

As mentioned, Google has also open sourced the C++ library they have been using in their crawler. This library may be considered a reference implementation of the draft protocol and includes a testing tool for robots.txt rules. Google new REP draft includes an updated Backus-Naur description of the syntactical rules a robots.txt file must obey to. Both the C++ library and the Backus-Naur specification go in the direction of making it easier for developers to build robots.txt parsers reliably.

The new REP draft has currently the status of a request for comments (RFC) and waiting on feedback from any involved parties.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

InfoQ Article Contest

Rate this Article

This content is in the Web Development topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

The InfoQ Newsletter