How to Create a Web Spider
A web spider is a computer application that downloads a web page, and then follows all of the links on that page and downloads them as well. Web spiders are used to store websites for offline reading, or for storage of web pages in databases to be used by a search engine. Creating a Web spider is a challenging task, suitable for a college-level programming class. These instructions assume you have solid programming experience but no knowledge of spider architecture. The steps lay out a very specific architecture for writing a Web spider in your chosen language.
Things You'll Need
- Web browser that responds to programmatic commands
- Programming language with read-write disk access and database functions
Instructions
-
-
1
Initialize your program with the initial web page you wish to download. Add the URL for this page to a new database table of URLs.
-
2
Send a command to the web browser instructing it to fetch this web page, and save it to a disk. Move the database pointer forward one step past the URL you just downloaded, which will now point to the end of the table.
-
-
3
Read the web page into the program, and parse it for links to additional web pages. This is typically done by searching for the text string "http://," and capturing the text between that string and a termination character (such as " ", ".", or ">"). Add these links to the URL database table; the database pointer should remain on top of this new list.
-
4
Test the entries in the database table for uniqueness, and remove any URLs that appear more than once.
-
5
If you wish to apply a URL filter (for example, to prevent downloading pages from sites at different domains), apply it now to the URL database table and remove any URLs you do not wish to download.
-
6
Set up a programmatic loop so your spider returns to step 2 above. This will recursively download all of the URLs your spider encounters. Removing duplicate URLs ensures that the spider will properly terminate when it reaches the last unique URL.
-
1
Tips & Warnings
If you are using a Unix operating system, check the Unix documentation (or "man pages") for curl and wget. These commands include many built-in spidering options, which can greatly reduce programming time and complexity.