Strip HTML tags from a string using regular expressions

My blog has moved.
You can view this post at the following address:
http://www.osherove.com/blog/2003/5/13/strip-html-tags-from-a-string-using-regular-expressions.html
Published Tuesday, May 13, 2003 9:41 AM by RoyOsherove
Filed under:

Comments

Tuesday, May 13, 2003 2:08 AM by Oisin

# re: Strip HTML tags from a string using regular expressions

This is probably not the best regular expression for stripping HTML either. A RegEx parser that performs greedy matching (e.g. try to match as much characters as possible) will match \1 in '<(.*)>' to 'i>important info</i' in '<i>important info</i>'.

I would suggest either use non-greedy matching via '<.*?>' (e.g. match the first '>' you find, not the last possible one) or use a more specific pattern like '<[^>]+>' -- e.g. match a '<' than match one or more sequential characters that are not '>' up until the first '>' you find.

Regex is a dark and deep hole that once you fall in, it's hard to get out; but like a big hole, there's light at one end of it ;)

Thursday, March 4, 2004 9:04 PM by Jorge

# re: Strip HTML tags from a string using regular expressions

I'm not sure if this topic is still of interest, but I thought I'd share my findings and hopefully address Paschal's need to keep certain tags. I am using the following regular expression to selectively strip potentially malicious tags from HTML text.

string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");

This expression doesn't suffer from some of the side effects of using the first expression (such as changing "5 < 8 and 3 > 1" to "5 1") and provides the added benefit of being case insensitive (?i:). And you can easily add/remove tag names that you want/don't want to strip.
Thursday, April 8, 2004 5:36 PM by Robert Andersson

# re: Strip HTML tags from a string using regular expressions

No technique offered will suffer perfectly valid (x)html such as:
<div title=">">...</div>

Just pointing it out, no time crafting a good regex now :)
Thursday, May 20, 2004 11:13 AM by Dan McCleary

# re: Strip HTML tags from a string using regular expressions

That's a good point, and definitely a risk - however, even in xhtml, and especially in xml, this should be written as <div title="&gt;">...</div>

Monday, June 28, 2004 6:16 AM by xx

# re: Strip HTML tags from a string using regular expressions

<a>xx</a>
Friday, July 2, 2004 11:02 AM by Peter carwell

# re: Strip HTML tags from a string using regular expressions

You should point out that you need a reference to System.Text.RegularExpressions
in your application for this to work
Tuesday, July 6, 2004 6:56 AM by Martin Nikolaev

# re: Strip HTML tags from a string using regular expressions

For the regular TAG MATCHING - why don't you just try

(<[^>]+>)


It will match all tags (also the nested ones) together with any attributes the tags contain!!!

cu
Monday, December 18, 2006 2:41 AM by walid

# re: Strip HTML tags from a string using regular expressions

What about " &nbsp; " ??