Excluding Pages From Search EnginesDo you really want all your pages indexed? Once a search engine knows your site exists, it will automatically search all your other pages using a robot and add these pages, (in time) to its index. Some engines only search two levels deep whereas others will index your whole site. However this invasiveness can cause problems if you do not want certain pages indexed. For example do you want your order forms, customized error pages, confirmation pages etc. listed on the search engines? Probably not. So here's what can you do to prevent it. There are two techniques. One is to use a special META tag you can include in the page you don't want indexed. The other is to create a robot text file. We will deal with the robot text file first. Robot Text FilesThe first thing a robot does when it visits you site is to look for a file called robots.txt. If the file exists it will follow the instructions contained within it. If there is no robots.txt file present then you are giving it free reign to index any page it wishes. By including a robots.txt file you can indicate exactly what is, and what is not off-limits to all, or just some robots. Use notepad or whatever text editor you prefer and set it out like this: # robots.txt file for http://www.yoursite.com User-agent: webcrawler Disallow: User-agent: altavista Disallow: / User-agent: * Disallow: /forms Disallow: /logs Any line starting with '#' specifies a comment. Use it for your own information. In this instance the first paragraph after the comments is specific to the robot called 'webcrawler' and states that webcrawler has nothing disallowed so it is free to go anywhere. The second paragraph indicates that the robot called 'altavista' is effectively barred from your entire site. The last paragraph indicates that all other visiting robots should not visit URLs starting with /forms or /log. The '*' is not a wildcard but a special character. You cannot use wildcard patterns or other expressions in the User-agent or Disallow fields. You also cannot string lines together like this: User-agent: * Disallow: /forms /logs /errors /tmp You must create a new Disallow line for each entry like this: User-agent: * Disallow: /forms Disallow: /logs Disallow: /errors Disallow: /tmp Once you are happy save the file as "robots.txt" (no quotes) and move it to your root directory of your site, i.e. where your default page resides. Just follow these simple rules and you should have no problems. META Exclusion TagsIf you are unable to create a robots.txt file because, for example, you share or don't administer the server your files are on then you can utilise the following META tags: <META NAME="ROBOTS" CONTENT="NOINDEX"> Including this line between your header tags in your HTML will mean that that page will not be indexed. If, instead you do: <META NAME="ROBOTS" CONTENT="NOFOLLOW"> then the page will be indexed, but any links in that document will not be followed by the robot. More Information about Robots.txt files is available here. |