[an error occurred while processing this directive]
Location: > Support > Robots.txt    Site Map
[an error occurred while processing this directive]

Search Indexing Robots and Robots.txt

Search engine robots will check a special file in the root of each server called robots.txt, which is, as you may guess, a plain text file (not HTML).

Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. Web administrators can disallow access to cgi, private and temporary directories, for example, because they do not want pages in those areas indexed.

The syntax of this file is obscure to most of us: it tells robots not to look at pages which have certain paths in their URLs. Each section includes the name of the user agent (robot) and the paths it may not follow. There is no way to allow a specific directory, or to specify a kind of file. You should remember that robots may access any directory path in a URL which is not explicitly disallowed in this file: everything not forbidden is OK.

You can usually read this file by just requesting it from the server in a browser (for example, www.yourdomain.com/robots.txt). You'll see it as a simple text page, but it's easy to read.

This is all documented in the Standard for Robot Exclusion, and all robots should recognize and honor the rules in the robots.txt file.

Entry Meaning
User-agent: *
Disallow:

The asterisk (*) in the User-agent field is shorthand for "all robots". Because nothing is disallowed, everything is allowed.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
In this example, all robots can visit every directory except the three mentioned.
User-agent: BadBot
Disallow: /

User-agent: *
Disallow: /private/

In this case, the BadBot robot is not allowed to see anything.

The blank line indicates a new "record" - a new user agent command.

All other robots can see everything except the "private" folder.

User-agent: WeirdBot
Disallow: /tmp/
Disallow: /private/
Disallow: /links/listing.html

User-agent: *
Disallow: /tmp/
Disallow: /private/

This keeps the WeirdBot from visiting the listing page in the links directory, the tmp directory and the private directory.

All other robots can see everything except the tmp and private directories.

If you think this is inefficient, you're right.

For more information, see the Standard for Robot Exclusion, the Guidelines For For Robot Writers, and the Web Server Administrator's Guide to the Robots Exclusion Protocol.

There are two proposed extensions of the Robots.txt standard: Martin Koster's 1996 RFC Draft Memo on Web Robots Control and Sean Connor's proposal for a An Extended Standard for Robot Exclusion (version 2.0)

You can also check for errors with the BotWatch robots.txt syntax checker or the UK Office for Library and Information Networking - WebWatch Robots.txt checker.

[an error occurred while processing this directive]