What Is a Bot Spider?
A bot spider is an automated computer program -- bot is shorthand for "robot" -- that accesses publicly available pages on the Web, retrieves their content and catalogs it. Bot spiders also follow all the external links, or hyperlinks -- words, phrases and pictures on which users can click to navigate from page to page -- on a page and catalog the content they find.
-
How Bot Spiders Work
-
A bot spider typically starts with a single, well-known Web address, otherwise known as a Universal Resource Locator (URL). The bot spider downloads the content from the Web page associated with that address and copies it into a database. Any external links on the page are added to a list, known as URL Frontier, which the bot spider uses to download and copy content from the destination or landing page for each link. Of course, most Web pages contain links, so bot spiders can start searching, or "crawling," almost anywhere on the Web.
Search Engines
-
Search engines, such as Google, Yahoo! and many others, use a cluster of bot spiders, operating in parallel, to create a snapshot of the Web on a regular basis. The aim is to create a local catalog, or index, of Web pages that the search engine can search for the most applicable results when a user types in a query. A set of behavioral policies, defined by the creator of the bot spiders, determines which Web pages are visited and how often. A search engine must, however, maintain an up-to-date catalog if it is to retain its reliability and credibility.
-
Selectivity
-
The Web consists of millions of pages, so even a cluster of spiders cannot be expected to download the whole of the Web before pages are added, modified or deleted. Bot spiders must therefore prioritize the pages they download and copy, often in relation to a predefined topic, or list of topics, or by downloading pages only with static text -- written in Hypertext Markup Language (HTML) -- and ignoring all other types of content.
Other Applications
-
Bot spiders are not only used by search engines. They can be used by other applications to validate the structure of Web pages, including hyperlinks, or to generate statistics that allow Web content to be better understood. Bot spiders can also be used to gather specific information, including email addresses and contact information, a function that is frequently exploited by originators of Internet junk mail, or spam.
-
References
- Photo Credit Hemera Technologies/Photos.com/Getty Images