How to Extract Text Between Tags in HTML Pages
HTML documents are made up of HTML tags, which describe the structure of a page, and the actual text content between the tags. Often when you want to extract text from a web page, you do not want the HTML tags included. Since web browsers hide HTML tags from view, it is easy to copy text from a web page while it is displayed in a browser. Besides copying, you can also change the web page into a plain text file, stripping off the HTML elements.
Instructions
-
-
1
Open the HTML file in a web browser that supports plain text; examples include Firefox and Internet Explorer. Click "File" and then click "Save As." Choose plain text as the file format from the file type drop-down list. Click "Save." The browser will convert the web page into a text document you can open in any text editor.
-
2
Load the HTML document in a web browser. Click and drag the left mouse button over the text you want to extract. Click the "Edit" menu or right-click the selection. Click "Copy." Open a new file in a third-party text editor or word processor. Click "Edit" and then click "Paste," or press "Control" and "V." Save the text as a plain text file.
-
-
3
Go to an online HTML-to-text converter such as at WebToolHub. Select and copy the text you want to extract and then paste into the conversion box. Click "Convert." The conversion site will remove all HTML tags, leaving only the text between the tags. Note that such converters provide little to no formatting of the text.
-
1
References
Resources
- Photo Credit Stockbyte/Stockbyte/Getty Images