Robot Phil

The Robots File

The robots.txt file is a simple text file used to direct compliant robots to the important parts of your website, as well as keep them out of private areas.  The robots text file can function as a means to limit access to personal, private, sensitive or unnecessary directories, documents and related digital materials, but by no means should it be used as a means of securing said directories, documents or materials.  The problem is that there are compliant bots that are run by respectable search engines and directories (ie: Googlebot, Bingbot, Slurp, Teoma, Robozilla), and then there are scammer bots that troll websites looking for active mail addresses and contact info to use to scam people out of their money with spam.

Why is a robots.txt File Important?

There are probably more than the following half-dozen reasons to use a robots file to direct search engine crawler traffic on your website...

  1. A robots.txt file can save on your bandwidth because when compliant spiders comes to visit, they won't crawl areas where there is no useful information (your cgi-bin, images, etc).  This assists your server as well as the search engine by reducing the overhead for the job.
  2. Although a robots.txt file shouldn't be thought of as security, It does give you a very basic level of protection in that it will keep people from easily finding stuff you don't want easily accessible via the search engines that would otherwise find it. They would actually have to visit your site and load the url in question instead of accidentally running into it on Google, Bing, Yahoo or Teoma.
  3. For house keeping.  The robots.txt file cleans up your hosting/server logs because an error isn't generated every time a search engine visits your site and requests the robots.txt, which can happen often in one day (especially if you are a good site promoter). When your website doesn't deliver a robots file for the search engines, a "404 Not Found" error is generated each time. Without the robots.txt file it gets harder to wade through all of these missing robots.txt file not found errors to find genuine errors that a webmaster needs to address each month.
  4. A robots file can even prevent penalties associated with duplicate content. For instance, let's say you have a high speed version of your pages with video and animations, a low speed version with simple images, and a mobile version of your site for phones.  Maybe you even have specific landing pages that are intended for use with a variety of advertising campaigns or marketing promotions. If these extra pages duplicates the content elsewhere on your site, you can find yourself being penalized by some of the search engines. But you can use the robots.txt file to prevent the necessary duplicated content from being indexed, and therefore avoid these issues and search engine imposed penalties. Sometimes we webmasters use the robots file to exclude "experimental" or "development" areas of a website which are not yet ready for public consumption.
  5. It's also good programming policy. Pros have a robots.txt. Amateurs don't. What group do you want your site to be seen in? Doesn't it make sense to accomodate the search engine's crawl bot if you want a good position in their search results?  Compliance to this standard is more of an pride/image and professionalism/courtesy issue than a "real" reason, but in competitive areas or when applying for a job (as a webmaster, for example) this can make a big difference. Some larger corporations and professionals may not consider hiring a webmaster who didn't know how to setup and use a simple robots text file to direct traffic, based on the assumption that then they may not to know other basics, including more important details, as well. Many feel it's sloppy and unprofessional not to use one.  And most are trying everything for any kind of advantage in search optimization.
  6. You can't use Google Webmaster Tools effectively without it. In order for Google to crawl your site, it's first task is to scan your robots.txt file for its intructions.  Without it, Google often incorrectly posts an error.  You need to have a working, validated robots.txt file in order to avoid Google claiming that your site is inaccessible simply because it can't find the robots.txt file. Since Google's Webmaster Tools are so valuable for insight into how the world's most popular search engines view your site, it's a good idea to use one.

Using the robots.txt File to Direct Traffic

The following is an example robots.txt file:

# robots.txt Sample:
User-agent: googlebot # (Google)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
User-agent: googlebot-image # (Google Image Search)
Disallow: /
User-agent: googlebot-mobile # (Google for Mobile)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
User-agent: Bingbot # (Microsoft)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
User-agent: psbot # (MSN PicSearch)
Disallow: /
User-agent: MSNBot # (Old Microsoft)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
Disallow: /*.gif$
Disallow: /*.jpg$
Disallow: /*.jpeg$
Disallow: /*.png$
User-agent: Slurp # (Old Yahoo Crawler)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Yahoo-Blogs/v3.9 # (Old Yahoo Crawler)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Yahoo-MMCrawler # (Old Yahoo Crawler)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Teoma # (Ask/Teoma)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: twiceler # (Cuil)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Scrubby # (Scrub The Web)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Robozilla # (DMOZ)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: Nutch # (Nutch)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: ia_archiver # (Alexa / Wayback Machine)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: baiduspider # (Baidu)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: naverbot # (Naver)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: yeti # (Yeti)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: asterias # (Singing Fish)
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
User-agent: *
Disallow: /cgi-bin/
Disallow: /php/
Disallow: /js/
Disallow: /scripts/
Disallow: /admin/
Disallow: /images/
# /END robots.txt sample

Here is the key to the instructions it is giving:

#
Ignore everything that follows on this line (used for comments).

User-agent:
All of the following instructions are directed to a specificly named bot/spider/crawler/robot, or everyone in general (by using the * asterisk wildcard).

Disallow:
Do not bother crawling anything that matches a specfic pattern (which follows this instruction). the argument given can be a location, such as a directory, or a pattern to indicate that it should avoid specific files using a certain naming pattern, or files with specific extensions.  Argument examples follow:
/cgi-bin/
...will avoid all directories called cgi-bin in the root ('/') directory.
/*.png$
...will disallow the crawling of Portable Network Graphic images (.png files) anywhere on the site, since there's usually no relevant textual content for a search engine to bother with, anyway.

There is another command you can use...
Allow:
...which would direct a specific (previously named) user agent to a specific directory.  But as a search engine would go there anyway (unless it is directed not to with Disallow:), this command is sort of redundant and will just increase the size of the robots.txt file that the bot has to scan and interpret.  We are mainly using the robots.txt file to avoid extra work, not create it.

 

Webmastering Related Resources:
Bit of All Right  -  Font-Journal  -  Standard Logo  -  Faviconvert  -  HTML Character Code  -  Glossary Index
Web Design and Development Group  -  The Meta Tag & SEO  -  phpinfo() File  -  Hyperlink Directory

 

Domain Name Registration and Worldwide Web Hosting Services Logo Banner

 
© MMXIII & MMXIV Douglas Peters (aka Doug Peters or DP), all rights reserved worldwide.