How to Scrape & Parse Addresses
Scraping and parsing are two closely related website data-mining practices. The more general, parsing, refers to breaking down data into its constituent parts. When your middle-school English teacher asked you to diagram sentences, you were parsing the words of those sentences for their parts of speech. Scraping more specifically refers to parsing web pages for particular types of data, in this case, addresses. The Python programming language and the "BeautifulSoup" extension allow the user to scrape and parse websites in a few lines of code.
Instructions
-
-
1
Install BeautifulSoup by downloading the latest version from crummy software and untar/unzip the file. Open a Terminal window and type the following command:
My-iMac:~ me$ python Downloads/BeautifulSoup-3.2.0/python setup.py installThis tells the Python interpreter to run the BeautifulSoup install script that can be found in the BeautfulSoup folder, which is in the Downloads folder.
-
2
Type Python at the prompt, hit return and import BeautifulSoup:
My-iMac:~ me$ python
>>> import BeautifulSoup -
-
3
Run the following script to open a web page and print any Universal Resource Locators (web addresses) you might find in a page:
>>>import urllib2
>>>page = urllib2.urlopen("http://www.THE URL YOU WANT TO SCRAPE HERE")
>>>soup = BeautifulSoup(page)
>>>soup.findAll('a')
>>>print soup.strip()
>>>print
This script will open a web page, parse the html, search for the <a> tag in which web addresses are embedded, remove the tags and leave the text.
-
1
References
- Photo Credit Hemera Technologies/AbleStock.com/Getty Images