Google Spider Theory
In order to create and maintain its database of webpages, Google uses automated programs called spiders, or Web crawlers, to traverse the Internet and record information about sites. These spiders download pages as they travel, collecting the information for tabulation in the search engine’s database. Google uses a unique system to classify and rank the pages its spiders discover, and that system has been one of the reasons for the search engine’s popularity and success over the years.
-
Spiders
-
Web spiders begin their journey across the internet with a set of seed URLs provided by their creator. The program visits the first page on the list, downloads it, and notes any hyperlinks on the page, adding them to the bottom of its list. Then it visits the next page and repeats the procedure. As the program travels, it builds up a list of linked URLs to visit, and if left to run indefinitely would eventually download every page on the internet that is reachable via hyperlink. Spiders also usually have an algorithm that sends them back to pages after a set period, to appraise them for any changes.
Early Web Crawling
-
When search engines first began using spiders to catalog Web pages, the algorithms involved were simple. The earliest search engines ranked pages by how often a given keyword appeared on the page, assuming that more repetitions meant more information about the selected topic. Web authors quickly learned to abuse this system, however, by a practice known as keyword stuffing. Page creators would use keywords repeatedly in the text, and would sometimes hide large banks of keywords in invisible text somewhere on the page to inflate their rankings.
-
Google
-
In 1996, Stanford students Larry Page and Sergey Brin decided the current search engine methodology was too easy to manipulate and produced substandard results. They proposed a new system that would take into account the relationship between Web pages instead of just counting words on a page. Their spiders would count the number of hyperlinks pointing to a given page and use that figure as a representation of the page’s relative worth, assuming that high-quality pages would naturally gather many such “backlinks” in the online community. Initially, they called their search engine “BackRub,” but would eventually rename it “Google” as it grew from a college project into a new business.
PageRank
-
Google’s PageRank system counts hyperlinks to a page as “votes of support." The more support a page has, the higher its ranking. As a page’s rank increases, so does the weight of its votes, meaning that a single vote from a highly ranked page may mean more than multiple votes from less prominent sites. This system de-emphasizes the practice of inflating your PageRank by creating a host of low-content pages all pointing to a single target, and can allow Web pages to rocket through the ranks merely by attracting the attention of other high-ranked sites.
-