spoton corporation home clients contact upload download artwork resources A-Z FAQs site map terms

The robots.txt file

The robots.txt file is very important, although surprisingly a lot of web companies don't offer this as part of their service.

 

What is a robots.txt file?

A robots.txt file is a special file on your site that search engine robots and spiders look for in order to know which files of the website to index and which pages or files should be ignored.

The robots.txt file is a simple text file (no HTML), that must be placed in your root directory, for example: http://www.yourwebsite.com/robots.txt 

 

How easy is it to create a robots.txt file?

The robots.txt file is a simple txt file, which is simple to create, by opening up text editor and adding 'records', which contains the information for the search engine.

Each record consists of two fields - the user agent line and one or more disallow lines, for exmple -

User-agent: googlebot

Disallow: /cgi-bin/  

This robots.txt file would allow 'googlebot' - Google's search engine spider, to index every page of your site except for the files in the 'cgi-bin' directory. All of the files in the cgi-bin will be ignored by googlebot.

The 'Disallow' command works like a wildcard. For example - 

User-agent: googlebot

Disallow: /support  

If the above is entered into the robots.txt file, both "/support-desk/index.html" and "/support/index.html" as well as all other files in the "support" directory wouldbe ignored by search engines.

A Disallow line must be entered for every User-agent record. However, if this line is left blank, the search engines will index all files within the website.

 

How do I give all search engine spiders the same rights?

To give all search engine spiders the same rights, use the following robots.txt content - 

User-agent: *

Disallow: /cgi-bin/  

Where can I find user agent names?

The user agent names can be found in your log files by checking for requests to robots.txt.

Usually all search engine spiders should be given the same rights and therefore the above 'User-agent: *' can be used.

 

What should I avoid?

Be careful to format your robots.txt file correctly or some or all of your files of your website might not get indexed by the search engine spiders. 

 

 

 

 

Is there an easy way to avoid errors with a robots.txt file?

  1. Don't use comments in the robots.txt file. Comments are allowed in a robots.txt file, but they can confuse some search engine spiders.

    "Disallow: support # Don't index the support directory" might be misinterepreted as "Disallow: support#Don't index the support directory".
     
  2. Don't use white space at the beginning of a line. For example, don't write

    placeholder User-agent: * place Disallow: /support

    but

    User-agent: *
    Disallow: /support
     
  3. Don't change the order of the commands. If your robots.txt file should work, don't mix it up. Don't write

    Disallow: /support
    User-agent: *

    but

    User-agent: *
    Disallow: /support  
     
  4. Don't use more than one directory in a Disallow line. Do not use the following

    User-agent: *
    Disallow: /support /cgi-bin/ /images/

    Search engine spiders cannot understand that format. The correct syntax for this is

    User-agent: *
    Disallow: /support
    Disallow: /cgi-bin/
    Disallow: /images/
     
  5. The right case is important, as the file names on your server are case sensitve. If the name of your directory is "Support", don't write "support" in the robots.txt file.
     
  6. Don't list all files. If you want a search engine spider to ignore all files in a special directory, you don't have to list all files. For example:

    User-agent: *
    Disallow: /support/orders.html Disallow: /support/technical.html Disallow: /support/helpdesk.html Disallow: /support/index.html

    You can replace this with User-agent: *
    Disallow: /support
     
  7. There is no "Allow" command Don't use an "Allow" command in your robots.txt file. Only mention files and directories that you don't want to be indexed. All other files will be indexed automatically if they are linked on your site.  

Tips and tricks for creating a robots.txt file

To allow all search engine spiders to index all files.
Use the following content for your robots.txt file and to allow all search engine spiders to index all files of your Web site:

User-agent: *
Disallow:

To disallow all spiders to index any file
If you don't want search engines to index any file of your Web site, use the following:

User-agent: *
Disallow: /  

More complex examples of robots.txt files

If you want to see more complex examples, of robots.txt files, view the robots.txt files of big Web sites:

o http://www.cnn.com/robots.txt

o http://www.nytimes.com/robots.txt

o http://www.spiegel.com/robots.txt

o http://www.ebay.com/robots.txt

 

Your Web site should have a proper robots.txt file if you want to have good rankings on search engines.

Only if search engines know what to do with your pages, they can give you a good ranking.  

[edit]