How to Create a robots.txt File for Your Website
The robots.txt file provides guidelines to any Web robots scanning your site. Search engines such as Google and Bing use Web robots to automatically index the Web. By default, robots crawl your entire website. However, most websites include files that aren't meant to be crawled because they weren't designed as part of the display portion of the site or for administrative reasons. The robots.txt file indicates which directories shouldn't be crawled. Keep in mind that malware robots and other programs designed to scan for vulnerable systems will ignore the file, so don't use it as a security measure.
- Difficulty:
- Easy
Instructions
Things You'll Need
- A basic text editor.
- Assumption: you already have a web site on a hosting server.
-
-
1
Open a plain text word processing program such as NotePad. Type the following line at the top of the file:
User-agent: *
This applies all the rules that follow to all robots.
-
2
Add a disallow line for each directory you don't want crawled:
Disallow: /administrator
This disallow line tells the robot you don't want it entering the directory which follows, don't include your whole URL in this line. For example, to disallow "mysite.com/dontcrawl," you'd type "Disallow: /dontcrawl" to the robots.txt file.
-
3
Add an additional disallow line for each directory you don't want crawled. Don't put more than one directory per line. You can also disallow a specific file or page by putting the exact file name.
-
4
Save the file as robots.txt on your computer. The file name must be all lower-case. Upload the file to the root directory of your website using FTP or your Web host's tools.
-
1
Tips & Warnings
If your hosting provider does not allow you to modify or have your own robots.txt file, you should enter a request with them to place a custom file for your site on their servers.
Technically, you are telling the search engines what they can see and index, by telling them what not to look at.
By not having a robots.txt file on your site, search engines assume that everything is OK to index.
Check the robots.txt file on other sites to see what they are blocking ( including search engines )
To inform the spiders to not index a whole directory, make sure to follow the directory name with a trailing slash. ie: /directory/ . The trailing slash tells the robot this is a directory.
Although most robots are running from UNIX servers, it's a good idea to make sure any directory or files named in the robots.txt file are exactly the same case as the file name on the server. ( windows servers will server up file names of mixed case ), UNIX servers will also serve up mixed case file names as long as they are configured to do so.
best practice is to name all files with lower case letter schemes not matter which server platform you are on.
Auto generated robots.txt file that send anything other than the text could make a search engine NOT index your site.
If the auto generated robots.txt file send and HTML page back on request, search engines may not index your site.
Related Searches
References
Comments
-
mar1965
Mar 16, 2008
Excellent tip! Thanks for sharing this! -
webmiser
Jan 14, 2008
Thank You, Wish more people actually understood this concept