How to: REGEX to Parse XML
Parsing XML presents a challenge to the would-be text analyzer owing to XML's extensibility. XML formatting conventions are hierarchical in nature, meaning some tags dominate other tags. Regular Expressions (REGEXes) identify XML text patterns -- A REGEX for matching XML tags will match everything inside xml tags <>, but won't display the hierarchical organization of these tags. It is possible to separate this tag structure from the text using the Python programming language and the Natural Language Toolkit package, which embeds regular expressions and text manipulation and can display the XML tags and their organization.
Instructions
-
-
1
Open a terminal window and type the command "python -v" at the prompt to check the presence and version of Python on your computer. Go to the NLTK homepage and download the NLTK installer package appropriate for your operating system. Check that NLTK is properly installed by entering the command ">>>import nltk" at the Python prompt.
-
2
Type ">>>nltk.download()" to open a window. Choose the row labeled "all" and click the download button. This will download a number of texts for NLTK to work with, among them Shakespeare's "The Merchant of Venice" formatted with special XML tags for plays.
-
-
3
Import the Merchant of Venice tagged in XML with the following command at the Python prompt:
>>> merchant_file = nltk.data.find('corpora/shakespeare/merchant.xml')
Assign the file a variable so that you can manipulate it with Python commands:
>>> raw = open(merchant_file).read()
Just to make sure that it is there, enter the following command to view the first 168 characters:
>>> print raw[0:168]
You will see the XML header tags and the special XML play tags.
-
4
Enter the following command at the Python prompt:
>>> from nltk.etree.ElementTree import ElementTree
and press "Return," then type the following at the Python prompt:
>>> merchant = ElementTree().parse(merchant_file)
The parse command allows the user to view the XML tags and their content. To build a hierarchical view of properly nested XML tags, enter the following command at the Python prompt:
>>> merchant.getchildren()
This will show all the special XML play tags in their hierarchical order. The output of this command should look like this:
[<Element TITLE at 2261b48>, <Element PERSONAE at 2261b20>, <Element SCNDESCR at 22cc260>, <Element PLAYSUBT at 22cc198>, <Element ACT at 22cc0f8>, <Element ACT at f2bff08>, <Element ACT at f3218a0>, <Element ACT at f0e8a30>, <Element ACT at ee07328>]
-
1
References
- "Natural Language Parsing with Python"; Stephen Bird, et al.; 2009
Resources
- Photo Credit Jason Reed/Photodisc/Getty Images