Define Spiders on the Computer

Define Spiders on the Computer thumbnail
Google's spider is named the Googlebot and builds their search index.

Although Web spiders are simply scripts running on an Internet connected computer, their name has a sinister connotation due to its eight-legged namesake. As a result, people often have a number of misconceptions about spiders and the way that they operate. In most cases, spiders are beneficial to your website, helping people to find you and the information they are searching for. Some spiders are parasites however, and steal content. It is possible to block these spiders from accessing your site.

  1. What Are Spiders?

    • In computing terms, spiders are automated scripts that crawl the Internet and retrieve information. Spiders start with a set of seed addresses to visit, and send out standard Web requests to download pages from those addresses. The spider parses the page, and extracts the target information. New addresses found from links on the downloaded pages are added to its database, and in time, those pages are crawled and the process continues. This allows the spider to automatically navigate its way around the Web, using the information it is programmed to gather to expand its database.

    How Do Spiders Work?

    • Spiders typically retrieve large amounts of information as they traverse the Internet, so in order to avoid running out of resources in terms of bandwidth and storage space, the spider uses a set of rules to crawl intelligently. The author of the script programs these rules into the script, to determine how many levels deep into a website the spider will travel, and how often the spider revisits the site to check for updated content. Automated spiders can generate many more Web requests than a human can within a short period, and this can adversely affect a websites performance. The script author usually avoids this scenario by staggering requests, so that the site owner has no reason to block the spider.

    Why Are Spiders Used?

    • Spiders have many purposes, but are primarily information gatherers. All of the search engines rely on spiders to scan the Web in order to create a searchable index, and without spiders, sites like Google or Yahoo would provide a much smaller result set. Price comparison sites use spiders to find vendors selling selected products, scanning the websites on a regular basis to show the latest prices. Although spiders have many legitimate and beneficial uses, they are also used for malicious purposes, such as scanning websites looking for email addresses to sell to email marketers. Other spiders can crawl websites looking for exploitable scripts and software with known vulnerabilities, in order to launch an attack and steal private data.

    Blocking Spiders

    • You can block search engine spiders from crawling your website by creating a robots.txt file. This is a plain text file stored in the root of your website, which enables you to issue instructions to compliant crawlers to control their behavior when they visit your site. You can target individual spiders, or use general instructions to target all spiders. One of the problems with this approach is that compliance is voluntary, and only legitimate bots obey the rules. Malicious spiders will simply ignore the rules, so you need an alternative method to block them. As spiders are simply scripts, they usually run from a static base, so requests come from the same IP address. If you find a spider visiting your site using the sites log file, you can see its IP address, which you can then block to stop the spider from accessing your site.

Related Searches:

References

Resources

  • Photo Credit Justin Sullivan/Getty Images News/Getty Images

Comments

Related Ads

Featured