The Robots Exclusion Protocol
Robots, in the Internet context, means software program that scan a website to collect information. These are not viruses -- there no code placed on your machine, and when the robot is finished with your website, there is no evidence that the robot was there. The information collection process is not necessarily harmful -- you might benefit from the visit. The Robots Exclusion Protocol (REP) allows you to have some control over the process.
-
History
-
The REP idea started in 1994 with a robot reader group (robots-request@nestor.co.uk) as a way to guide robots through websites. The basic idea was to install a short file with known name and location the instructs the robot where to look. These directions would probably be ignored by malevolent robots, but could be used by benign robots to save them some time by examining only some of your files. The basic protocol was enhanced in 2008 by a large number of the major Internet companies including Yahoo and Google.
Benign Robots
-
There are some robots you actually want to visit your website. For example, search engines use robots to index the Internet. Starting with a single website address, the robot classifies that website and keeps a list of all the links found on the website. Then the robot goes down the list of collected website addresses. As the list of new websites created each month are publicly available, there is a backlog of websites to check that keeps the robots working day and night. You want these robot visits because you want the search engines to know and classify your website so potential customers can find you through search engines.
-
Malevolent Robots
-
Robots can also be used for destructive purposes. For example, robots can compile a list of e-mail addresses indexed by interests. To do this, they look for anything that has an "@" symbol and take the string around it that is bound by spaces. This is why you will see some computer science professors give their address as Professor.Abc {at sign} University.edu -- it is to foil evil robots. To classify your e-mail address according to interest, the robot looks in the META statement that is part of the code behind every website.
REP Syntax
-
The robots.txt file is installed in a directory. If your website is www.widgits.com, the pathname to the robots.txt file will be www.widgits.com/robots.txt. The first line in the file will be "user-agent:" and the next line will be "Disallow:" -- the first line selects the population of robots and the second line shows which directories are off limits. Using ";" to indicate a line break, "user-id: * ; /abc/" are the two line statements that direct all robots to avoid the abc directory. To allow SearchBot to examine everything, but forbid all other robots, the code would be "user-id: SearchBot ; disallow: ; user-id: * ; disallow: /" -- * means all robots, / means all directories and a blank space means no directories.
-
References
Resources
- Photo Credit roboter image by Ewe Degiampietro from Fotolia.com