Strip HTML tags from a string using regular expressions

My blog has moved.
You can view this post at the following address:
http://www.osherove.com/blog/2003/5/13/strip-html-tags-from-a-string-using-regular-expressions.html
Published Tuesday, May 13, 2003 9:41 AM by RoyOsherove
Filed under:

Comments

Monday, May 12, 2003 8:22 PM by Paschal

# re: Strip HTML tags from a string using regular expressions

Thanks Roy for this but what I want is to keep some tags like <p><br>, etc...
Tuesday, May 13, 2003 2:08 AM by Oisin

# re: Strip HTML tags from a string using regular expressions

This is probably not the best regular expression for stripping HTML either. A RegEx parser that performs greedy matching (e.g. try to match as much characters as possible) will match \1 in '<(.*)>' to 'i>important info</i' in '<i>important info</i>'.

I would suggest either use non-greedy matching via '<.*?>' (e.g. match the first '>' you find, not the last possible one) or use a more specific pattern like '<[^>]+>' -- e.g. match a '<' than match one or more sequential characters that are not '>' up until the first '>' you find.

Regex is a dark and deep hole that once you fall in, it's hard to get out; but like a big hole, there's light at one end of it ;)

Tuesday, May 13, 2003 2:51 AM by Hugh Brown

# re: Strip HTML tags from a string using regular expressions

Your regex eliminates everything in this HTML:

string html = "<html><head><title>asasdasd</title></head><body><h1>qweqweqwe</h1><div>This is the content</div></body></html>";
string modified = StripHTML(html);
Console.WriteLine (modified);

I typically use patterns more like this to find html tags:

private static string linkPattern = @"(\<link[^\>]+\>)";

Lots of bath water left when I'm done with that baby.
Tuesday, May 13, 2003 3:19 AM by Roy Osherove

# re: Strip HTML tags from a string using regular expressions

Thanks for the great tips guys! I'll look in to it and fix the samples. :)
Monday, December 29, 2003 1:00 AM by Joshua Olson

# re: Strip HTML tags from a string using regular expressions

The regex's at the following link may be useful in this case. They are more robust in terms of matching HTML tags than the simple pattern provided earlier.

http://concepts.waetech.com/unclosed_tags/
Thursday, March 04, 2004 9:04 PM by Jorge

# re: Strip HTML tags from a string using regular expressions

I'm not sure if this topic is still of interest, but I thought I'd share my findings and hopefully address Paschal's need to keep certain tags. I am using the following regular expression to selectively strip potentially malicious tags from HTML text.

string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");

This expression doesn't suffer from some of the side effects of using the first expression (such as changing "5 < 8 and 3 > 1" to "5 1") and provides the added benefit of being case insensitive (?i:). And you can easily add/remove tag names that you want/don't want to strip.
Thursday, April 08, 2004 5:36 PM by Robert Andersson

# re: Strip HTML tags from a string using regular expressions

No technique offered will suffer perfectly valid (x)html such as:
<div title=">">...</div>

Just pointing it out, no time crafting a good regex now :)
Thursday, May 20, 2004 11:13 AM by Dan McCleary

# re: Strip HTML tags from a string using regular expressions

That's a good point, and definitely a risk - however, even in xhtml, and especially in xml, this should be written as <div title="&gt;">...</div>

Monday, June 28, 2004 6:16 AM by xx

# re: Strip HTML tags from a string using regular expressions

<a>xx</a>
Friday, July 02, 2004 11:02 AM by Peter carwell

# re: Strip HTML tags from a string using regular expressions

You should point out that you need a reference to System.Text.RegularExpressions
in your application for this to work
Tuesday, July 06, 2004 6:56 AM by Martin Nikolaev

# re: Strip HTML tags from a string using regular expressions

For the regular TAG MATCHING - why don't you just try

(<[^>]+>)


It will match all tags (also the nested ones) together with any attributes the tags contain!!!

cu
Wednesday, December 08, 2004 2:48 PM by TrackBack

# Just one Line of Code

I keep forgetting about this one line of code that makes Regex one of the best creations in the history of computing.
Thursday, March 31, 2005 5:21 PM by TrackBack

# Parsing: Beyond Regex

I've blogged ad nauseam about how much I love Regular Expressions, but even the mighty regular expression has limits. As noted in Daniel Cazzulini's blog: A full-blown programming language cannot be parsed with regular expressions. But given the limited...
Monday, December 18, 2006 2:41 AM by walid

# re: Strip HTML tags from a string using regular expressions

What about " &nbsp; " ??

Tuesday, August 12, 2008 8:19 PM by Twitter Mirror

# RoyOsherove : weird - my top viewed page this month:http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx

RoyOsherove : weird - my top viewed page this month: weblogs.asp.net/.../05

Tuesday, June 09, 2009 8:10 PM by   links for 2009-06-09 by jonezy.org

# &nbsp; links for 2009-06-09&nbsp;by&nbsp;jonezy.org

Pingback from  &nbsp; links for 2009-06-09&nbsp;by&nbsp;jonezy.org

# Function to parse HTML tags returning text | it.rss24h.com

Pingback from  Function to parse HTML tags returning text | it.rss24h.com

Saturday, December 26, 2009 8:19 PM by Dev Links « Blogosphere

# Dev Links &laquo; Blogosphere

Pingback from  Dev Links &laquo; Blogosphere

# SharePoint Rich Text and Infopath Form Tutorial | SharePoint Tutorials

Pingback from  SharePoint Rich Text and Infopath Form Tutorial | SharePoint Tutorials