How to Use a RegExp to Remove Unwanted HTML

By NeilO

Rate: (22 Ratings)

A common and helpful aspect of Regular Expressions is filtering out bad data on the server-side of a web application. In many cases you want to do this versus a hard-failure, because users are copy&pasting, not necessarily trying to hack you intentionally. Well, here's some steps you can follow to remove non-approved HTML tags from a string.

Instructions

Difficulty: Moderate

Things You’ll Need:

  • Server-Side Language

Step1
Get a string containing the HTML.
Step2
Construct a regular expression pattern that uses a negative lookahead to match on unexpected tags. A negative lookahead is a match group starting with "?!". Note this is a LOOKAHEAD and doesn't actually move the parser forward, so you'll need to flesh out the pattern to include the actual data.
Step3
Make sure the pattern ignores case and matches globally.
Step4
Call replace on the string and pass in your regular expression.
Step5
That should be it, please view the example I have attached as an image. This is written in JavaScript for Windows Scripting Host, but the logic is pertinent to most major languages (C#, PHP, Python, Java, etc.).

Tips & Warnings

  • Isolated less-than signs (<), e.g. those that are not part of tags, will cause some nasty results. If you are expecting those, run another filter first to convert them to entities (&lt;).

Post a Comment

POST A COMMENT

Request a New How-To Article

Looking for more How To information? Chances are there’s an eHow member who knows how to do what you’re looking to do. Submit an article request now!

eHow Article:  How to Use a RegExp to Remove Unwanted HTML

eHow Member: NeilO

NeilO

Authority Authority | 4040 Points

Category: Internet

Articles: See my other articles

Related Ads

Internet

Veesites
Meet Virginia DeBolt eHow’s Internet Expert.