The Robots Exclusion Protocol

The Robots Exclusion Protocol thumbnail
Software robots can infest your computer and do all sorts of damage.

Robots, in the Internet context, means software program that scan a website to collect information. These are not viruses -- there no code placed on your machine, and when the robot is finished with your website, there is no evidence that the robot was there. The information collection process is not necessarily harmful -- you might benefit from the visit. The Robots Exclusion Protocol (REP) allows you to have some control over the process.

  1. History

    • The REP idea started in 1994 with a robot reader group (robots-request@nestor.co.uk) as a way to guide robots through websites. The basic idea was to install a short file with known name and location the instructs the robot where to look. These directions would probably be ignored by malevolent robots, but could be used by benign robots to save them some time by examining only some of your files. The basic protocol was enhanced in 2008 by a large number of the major Internet companies including Yahoo and Google.

    Benign Robots

    • There are some robots you actually want to visit your website. For example, search engines use robots to index the Internet. Starting with a single website address, the robot classifies that website and keeps a list of all the links found on the website. Then the robot goes down the list of collected website addresses. As the list of new websites created each month are publicly available, there is a backlog of websites to check that keeps the robots working day and night. You want these robot visits because you want the search engines to know and classify your website so potential customers can find you through search engines.

    Malevolent Robots

    • Robots can also be used for destructive purposes. For example, robots can compile a list of e-mail addresses indexed by interests. To do this, they look for anything that has an "@" symbol and take the string around it that is bound by spaces. This is why you will see some computer science professors give their address as Professor.Abc {at sign} University.edu -- it is to foil evil robots. To classify your e-mail address according to interest, the robot looks in the META statement that is part of the code behind every website.

    REP Syntax

    • The robots.txt file is installed in a directory. If your website is www.widgits.com, the pathname to the robots.txt file will be www.widgits.com/robots.txt. The first line in the file will be "user-agent:" and the next line will be "Disallow:" -- the first line selects the population of robots and the second line shows which directories are off limits. Using ";" to indicate a line break, "user-id: * ; /abc/" are the two line statements that direct all robots to avoid the abc directory. To allow SearchBot to examine everything, but forbid all other robots, the code would be "user-id: SearchBot ; disallow: ; user-id: * ; disallow: /" -- * means all robots, / means all directories and a blank space means no directories.

Related Searches:

References

Resources

  • Photo Credit roboter image by Ewe Degiampietro from Fotolia.com

Comments

You May Also Like

  • A Look Inside Robots

    Though the classic autonomous robots of old science fiction stories are not as advanced and widespread today as they are in the...

  • Robots in Our Future

    The future of robots is mere speculation, but judging from developments in recent years, the continued advancements in technology are a foregone...

  • How to Build Robots for Beginners

    With today's technology and available parts even beginners can build robots. Building your first robot can be very exciting as you learn...

  • How Is Bacteria Used in Genetic Engineering?

    Bacteria is used abundantly in genetic engineering to study life, disease and medicine. Understand how scientists use bacteria on a regular basis...

  • Masters Thesis Ideas in Manufacturing Technology

    You could explore how modern robotics and automation systems is interwoven into the manufacturing process. With today's CAD/CAM (computer-aided design/computer-aided ...

  • How to Fix HTTP 406 Error on a Blackberry 8703E

    An HTTP error 406 on a Blackberry 8703E means that your IT administrator has denied you access to download a third-party application....

  • What Are IRB Protocols?

    What Are IRB Protocols?. An Institutional Review Board (IRB) is a committee within an organization charged with compliance to federal and institutional...

Related Ads

Featured