How to Block Duplicate Content Scraper Sites

How to Block Duplicate Content Scraper Sites thumbnail
Edit the .htaccess file in order to block web scraping programs.

Scraper sites are websites that have republished another website's content without permission. Scraping a website's content is an infringement on intellectual property. It also creates problems with search engine optimization (SEO). Search engine optimization is the practice of getting a website or its pages to appear prominently in search engine result pages (SERP). Duplicate content on the Internet can negatively affect the original website's search engine ranking.

Instructions

    • 1

      Open a file transfer program (FTP) that you will use to access your website's files. You will need an FTP account in order to login. If you don't have one, go to the cpanel of your website hosting account. The cpanel is the main control panel that adjustments are made.

      Once you have logged into your cpanel, look for the "Files" section. Open "FTP Accounts." Under "Add FTP" account, input the login name and password you want for the account.

      Go back to your FTP program. Under domain, enter your website's domain, and login with your FTP account's login name and password.

    • 2

      Go to the root directory or main directory of your website. Find the file titled ".htaccess." Right click on it, and click on "Open/Edit." The .htaccess document will open in a text program.

    • 3

      Highlight the code below, and copy it.

      # Blocking Bots and Spiders

      RewriteEngine On

      RewriteCond {REQUEST_URI} =sitemaps.xml

      RewriteRule ^ sitemaps.xml [L]

      RewriteCond %{REMOTE_HOST} ^77.91.224.* [OR]

      RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} discobot [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Bot\ [OR]

      RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]

      RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]

      RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]

      RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]

      RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]

      RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]

      RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]

      RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]

      RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]

      RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]

      RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]

      RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]

      RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]

      RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]

      RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]

      RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]

      RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]

      RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]

      RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]

      RewriteCond %{HTTP_USER_AGENT} LinksManager.com_bot [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} linkwalker [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]

      RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]

      RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]

      RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]

      RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]

      RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]

      RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]

      RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]

      RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]

      RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]

      RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]

      RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]

      RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]

      RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]

      RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]

      RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]

      RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]

      RewriteCond %{HTTP_USER_AGENT} webalta [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]

      RewriteCond %{HTTP_USER_AGENT} WebCollage [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]

      RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]

      RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]

      RewriteCond %{HTTP_USER_AGENT} Yandex [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} zermelo [NC,OR]

      RewriteCond %{HTTP_USER_AGENT} ^Zeus [NC, OR]

      RewriteCond %{HTTP_USER_AGENT} ZyBorg [NC]

      RewriteRule .* bot-response.php [L]

    • 4

      Go back to the .htaccess file. Paste the code into the .htaccess file. Paste it above the line "# END".

      Save the file on your computer. Make sure to save it as .htaccess. Do not add extensions to this file, e.g., .htaccess.txt or .htaccess.html. It must be saved only as .htaccess.

    • 5

      Go back to the FTP program. The "Local" section displays the data on your computer. Find the folder and the .htaccess file you saved. The "Distance" section displays your website's directory. Drag and drop the .htaccess file from your computer onto your website's main directory. Your website is now protected from scraper sites.

Related Searches:

References

Resources

  • Photo Credit Photos.com/Photos.com/Getty Images

Comments

You May Also Like

  • How to Set Up Scrapers in XBMC

    The XBMC media center software is a free application for Windows, Macintosh and Linux commonly used for enjoying digital media such as...

  • How to Maximize SEO

    You or your company can have the best-looking website in the world, but if no one sees it, your efforts are totally...

  • How to Create an .Htaccess File

    An .htaccess file provides basic settings for web hosting with Apache servers. The .htaccess file is a plain text document placed in...

Related Ads

Featured