Strip HTML tags from a string using regular expressions - ISerializable - Roy Osherove's Blog

Strip HTML tags from a string using regular expressions

Paschal asked me to find a simple solution for stripping HTML tags from a given string using Regular expressions.

The solution is quite simple:

1. Retrieve all the HTML tags using this pattern: <(.|\n)*?>

2. Replace them with an empty string and return the result

Here's a C# function that does this:

private string StripHTML(string htmlString)

{

//This pattern Matches everything found inside html tags;

//(.|\n) - > Look for any character or a new line

// *?  -> 0 or more occurences, and make a non-greedy search meaning

//That the match will stop at the first available '>' it sees, and not at the last one

//(if it stopped at the last one we could have overlooked

//nested HTML tags inside a bigger HTML tag..)

// Thanks to Oisin and Hugh Brown for helping on this one...

string pattern = @"<(.|\n)*?>";

 

return  Regex.Replace(htmlString,pattern,string.Empty);

}

Or with just one line of code:

string stripped = Regex.Replace(textBox1.Text,@"<(.|\n)*?>",string.Empty);

 

Published Tuesday, May 13, 2003 12:41 PM by RoyOsherove
Filed under:

Comments

Monday, May 12, 2003 8:22 PM by Paschal

# re: Strip HTML tags from a string using regular expressions

Thanks Roy for this but what I want is to keep some tags like <p><br>, etc...
Tuesday, May 13, 2003 2:08 AM by Oisin

# re: Strip HTML tags from a string using regular expressions

This is probably not the best regular expression for stripping HTML either. A RegEx parser that performs greedy matching (e.g. try to match as much characters as possible) will match \1 in '<(.*)>' to 'i>important info</i' in '<i>important info</i>'.

I would suggest either use non-greedy matching via '<.*?>' (e.g. match the first '>' you find, not the last possible one) or use a more specific pattern like '<[^>]+>' -- e.g. match a '<' than match one or more sequential characters that are not '>' up until the first '>' you find.

Regex is a dark and deep hole that once you fall in, it's hard to get out; but like a big hole, there's light at one end of it ;)

Tuesday, May 13, 2003 2:51 AM by Hugh Brown

# re: Strip HTML tags from a string using regular expressions

Your regex eliminates everything in this HTML:

string html = "<html><head><title>asasdasd</title></head><body><h1>qweqweqwe</h1><div>This is the content</div></body></html>";
string modified = StripHTML(html);
Console.WriteLine (modified);

I typically use patterns more like this to find html tags:

private static string linkPattern = @"(\<link[^\>]+\>)";

Lots of bath water left when I'm done with that baby.
Tuesday, May 13, 2003 3:19 AM by Roy Osherove

# re: Strip HTML tags from a string using regular expressions

Thanks for the great tips guys! I'll look in to it and fix the samples. :)
Monday, December 29, 2003 1:00 AM by Joshua Olson

# re: Strip HTML tags from a string using regular expressions

The regex's at the following link may be useful in this case. They are more robust in terms of matching HTML tags than the simple pattern provided earlier.

http://concepts.waetech.com/unclosed_tags/
Thursday, March 04, 2004 9:04 PM by Jorge

# re: Strip HTML tags from a string using regular expressions

I'm not sure if this topic is still of interest, but I thought I'd share my findings and hopefully address Paschal's need to keep certain tags. I am using the following regular expression to selectively strip potentially malicious tags from HTML text.

string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");

This expression doesn't suffer from some of the side effects of using the first expression (such as changing "5 < 8 and 3 > 1" to "5 1") and provides the added benefit of being case insensitive (?i:). And you can easily add/remove tag names that you want/don't want to strip.
Thursday, April 08, 2004 5:36 PM by Robert Andersson

# re: Strip HTML tags from a string using regular expressions

No technique offered will suffer perfectly valid (x)html such as:
<div title=">">...</div>

Just pointing it out, no time crafting a good regex now :)
Thursday, May 20, 2004 11:13 AM by Dan McCleary

# re: Strip HTML tags from a string using regular expressions

That's a good point, and definitely a risk - however, even in xhtml, and especially in xml, this should be written as <div title="&gt;">...</div>

Monday, June 28, 2004 6:16 AM by xx

# re: Strip HTML tags from a string using regular expressions

<a>xx</a>
Friday, July 02, 2004 11:02 AM by Peter carwell

# re: Strip HTML tags from a string using regular expressions

You should point out that you need a reference to System.Text.RegularExpressions
in your application for this to work
Tuesday, July 06, 2004 6:56 AM by Martin Nikolaev

# re: Strip HTML tags from a string using regular expressions

For the regular TAG MATCHING - why don't you just try

(<[^>]+>)


It will match all tags (also the nested ones) together with any attributes the tags contain!!!

cu
Wednesday, December 08, 2004 2:48 PM by TrackBack

# Just one Line of Code

I keep forgetting about this one line of code that makes Regex one of the best creations in the history of computing.
Thursday, March 31, 2005 5:21 PM by TrackBack

# Parsing: Beyond Regex

I've blogged ad nauseam about how much I love Regular Expressions, but even the mighty regular expression has limits. As noted in Daniel Cazzulini's blog: A full-blown programming language cannot be parsed with regular expressions. But given the limited...
Monday, December 18, 2006 2:41 AM by walid

# re: Strip HTML tags from a string using regular expressions

What about " &nbsp; " ??

Tuesday, August 12, 2008 8:19 PM by Twitter Mirror

# RoyOsherove : weird - my top viewed page this month:http://weblogs.asp.net/rosherove/archive/2003/05/13/6963.aspx

RoyOsherove : weird - my top viewed page this month: weblogs.asp.net/.../05

Tuesday, June 09, 2009 8:10 PM by   links for 2009-06-09 by jonezy.org

# &nbsp; links for 2009-06-09&nbsp;by&nbsp;jonezy.org

Pingback from  &nbsp; links for 2009-06-09&nbsp;by&nbsp;jonezy.org

# Function to parse HTML tags returning text | it.rss24h.com

Pingback from  Function to parse HTML tags returning text | it.rss24h.com