What Is an Internet Spider?
Internet or Web spiders, sometimes known as "Web crawlers" or "Web robots," are computer programs that explore the World Wide Web, gathering data about websites and pages. Search engines often use spiders to provide information about the content of websites and the links between them. Internet spiders browse websites by following links to them from other sites and browsing the pages within a site in the same way, using HTML anchors.
-
Web Structure
-
The ability to create links between Web pages is a key aspect of the Internet. Pages within a site can link to each other, as well as to other sites, allowing users to access information using simple mouse clicks. This results in the structure of the Web, which is a mass of Web content linked through HTML anchors. Web crawlers follow these links to obtain information about the sites in existence, often using the data discovered while crawling to present search engine results.
Search Engines
-
Search engines send visitor traffic to the websites listed in their pages. When a user enters a search term and performs a search, the results presented often contain information obtained through crawling. The data gathered by a Web spider program includes some of the actual site content. The search engines feed this data into the algorithms they use to rank sites in order of importance in search listings. Internet spider programs often arrive at a site by following a link to it from another site. When analyzing the crawl data, one of the main aims for search engines is to determine which search keywords a site or page should be listed for.
-
Site Access
-
Website owners can achieve a level of control over the ways in which Web spiders access their content. Many websites store a text file in the root directory named "robots.txt." When the crawler program initiates exploring a site, it will normally first check for any "robots.txt" files, analyzing the content. Website owners can structure their "robots.txt" file in a way that prevents the program from proceeding to explore the pages within the site if they do not want it to be indexed. The degree of success for this technique varies, as in some cases the spider program will not actually check the text file at all.
Website Marketing
-
People who specialize in Internet marketing often focus some of their efforts on maximizing the content and structure of a site to best suit the search engine spiders and ranking algorithms. The ability to do this successfully is sometimes hampered by the fact that search engine organizations like to keep the details of their algorithms secret. SEO (Search Engine Optimization) is the practice of tailoring the structure and content of a site to perform as well as possible in the search engine results pages.
-
References
- Stanford University; Introduction to Information Retrieval - Overview; Christopher Manning, et al.; April 2009
- IBM developerWorks; Build a Web spider on Linux; Tim Jones; November 2006
- The Main Frame; Intro to Web Spiders; Ryan Smith; July 2007
- The Web Robots Pages: About /robots.txt
- WebProNews; An Introduction To SEO; Daryl Quenet; March 2008
Resources
- Photo Credit Hemera Technologies/Photos.com/Getty Images