How to Write an Email Miner for Python
Mining email is a means of extracting information, such as the number of words, sentences, or the richness of your correspondent's vocabulary, from the text in email. Writing an email miner with Python requires many "moving parts" in the form of Python extensions, called packages, that download mail messages off of servers. Messages are converted to strings so that other Python packages can parse them and display what they find. This is a highly complex task that requires more than a passing casual knowledge of Python programming. Therefore, proceed with caution and patience.
Instructions
-
-
1
Open a terminal session and type python -v at the prompt to check that you have Python 2.6 or higher, but not 3.0 or higher. Versions 2.6 or 2.7 are ideal because they are compatible with NLTK and PyYAML. Visit the Python packages index page; find and download the PyYAML and NLTK packages. Unzip/untar them. Change your directory to the PyYAML directory. At command line prompt type in: sudo python setup.py install. It should look like this:
My-Computer:PyYAML-3.2.0 Me$ sudo python setup.py install
You will be prompted for a password. Type it and press the return button. Follow this procedure for every Python package you install.
-
2
Download mail messages for parsing with the following lines of code:
#!/usr/local/bin/python
import poplib, getpass, sys, mailconfig
mailserver = mailconfig.popservername
mailuser = mailconfig.popusername
mailpasswd = getpass.getpass('Password for %s?' % mailserver)
server = poplib.POP3(mailserver)
server.user(mailuser)
server.pass_(mailpasswd)
print(server.getwelcome())
msgCount, msgBytes = server.stat()
print('There are', msgCount, 'mail messages in', msgBytes, 'bytes')
print(server.list())
print('-' * 80)
input('[Press Enter key]')
for i in range(msgCount):
hdr, message, octets = server.retr(i+1)
for line in message: print(line.decode())
read('-' * 80)
if i < msgCount - 1:
This script will connect to your pop3 email server, prompt you for your user name and password, count the number of messages on the server and read them into memory.
-
-
3
Mine your email messages by converting each message to a string, a native data type in Python, that can be searched with Python's string methods, regular expression engine, and Natural Language Toolkit:
m = msgCount[1]
s = str(m)
from email.parser import Parser
import nltk
import re
-
4
Mine the first message for any information of interest. Discover how many words are in that message by entering the following command:
>>>>len(s)
It will return an integer value for the number of words. To find every sentence with the word mortgage, enter the following NLTK command:
>>>>s.concordance('mortgage')
This will return every sentence with the word mortgage in it; very useful for detectives investigating mortgage fraud.
-
1
References
- Programming Python; Mark Lutz, 2010
- Python pop3 library
- Python email Package
- Photo Credit Medioimages/Photodisc/Photodisc/Getty Images