How to Spider Documents
If you are looking to making your documents accessible online, then you want to spider or index them. Web robots or spiders are the most commonly used programs, even by search engines like Google, to traverse the Web indexing content and collecting information. To spider or index your documents will require that you create a "robots.txt" file. This file is located in the main Website directory and advises other spiders and robots on what files and documents to access. Robots thus help to reduce wastage of server resources and help remove clutter from Web statistics especially for URLs that have been moved or removed.
Instructions
-
-
1
Make a list of all the documents that you want to index and those that you do not want indexed.
-
2
Open Notepad and copy the lines below:
User-agent: *
Disallow: /images/
User-agent: Googlebot-Image
Disallow: /images
The first section above prevents spiders from accessing the "images" folder effectively removing it from being indexed.
The second section specifies that the "Googlebot-Image" spider should skip indexing the "images" folder.
-
-
3
Add as many "Disallow" statements as you prefer depending on the folders that you want skipped during indexing. Refer to the list you created earlier to ensure no folder is missed.
-
4
Specify specific files that you want skipped during indexing as shown below:
User-agent: *
Disallow: /documents/ehow.txt
The above statements will tell all spiders to avoid indexing the "ehow.txt" file that is located inside the "Documents" folder.
The above statement can similarly be replicated for any other documents that need to be skipped during indexing.
-
5
Save the above file as "robots.txt" and upload it onto the main directory of the website.
-
1
Tips & Warnings
A site maps protocol file can be used to give search engines a list of all the pages that exist on your website. Alternatively, if you want all folders indexed on your website, create a robots.txt file with the following information:
User-agent: *
Disallow:
To block a particular spider from indexing your documents, check out the search engine website to find out the robot's name and details on how to prevent it accessing your files and directories.
If you have a secret directory that you want skipped during indexing, do not list it in the "robots.txt" file. Spammers and hackers can easily access the "robots.txt" file and read its contents.