Image – © Vladislav Kochelaevs – Fotolia.comThe web contains millions of websites and billions of webpages. Search engines use crawlers and robots to index these webpages. Moreover, cyber criminals use spam bots to collect email addresses. 
 
The robots.txt file instructs crawlers and robots which pages on your site they can and cannot crawl. This ensures that the crawlers cannot access sensitive information. For example, a robots.txt file is necessary if you run an e-commerce site. The file would instruct robots not to crawl and index client information stored in a database. A robots.txt file will also come in handy if you would like to restrict the indexing of research materials.


Setting up a Robots.txt File

It is quite easy to set up a robots.txt file. You can use an ASCII text editor such as Notepad to create the file. The rule of thumb is to list the name of spiders on one line. On the next line, you need to list the directory or file crawlers should not access. If your list contains numerous directories, each directory should be on a separate line. The same is true if you would like to block more than one robot. For example, you can have a robots.txt file that contains the following list.

User-agent: Googlebot
Disallow: /folder2/

 
This robots.txt file instructs Googlebot not to crawl or index /folder2/. The file may also contain directories in the following fashion.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~mike/


In this case, the file restricts access to three directories. It is important to note that pattern matching may or may not work in a robots.txt file. This may seem counterintuitive since the example above contains the * character. This special character blocks all robots from accessing the listed directories and files. Some robots such as Googlebot support the use of pattern matching expressions. When referring to a directory, remember to include the trailing slash ('/').

After creating your robots.txt file, make sure you save the file in a high-level directory on your site. For example, if your site is test.com, place the file at test.com/robots.txt. A file placed at test.com/mysite/robots.txt is invalid. In addition, you can specify additional parameters in your robots.txt file. These include sitemap, crawl-delay, request-rate, visit-time, and allow.

The robots.txt file plays an important role in the way search engines index pages from your website. With this file, you can restrict access to directories, files, images, page, or even the entire site.