Stripping out HTML tags

Monday, April 14, 2003

.NET

This blog is amazing to find solutions. I enjoy to be part of this community.

Well another idea on removing HTML tags from Christian Dehaeseleer:

" If your HTML is not "simple" then RegEx will not work... (what if your HTML contains "<" etc.).

Another option is to parse your HTML as XML with SgmlReader (available on GotDotNet) and then treat the XML as you wish (for instance using a default XSL Template will remove all tags...) "

Hey folks, what do you think about this versus using Regex.

1 Comment

here is my quick and dirty solution, although with regular expressions is more elegant, I guess I need to buy a RegExp book :)))

public static string StripHTMLTags(string html)

{

string[] open_fragments = html.Split(new Char[] {'<'});

StringBuilder sb = new StringBuilder();

foreach(string fragment in open_fragments)

{

int loc = fragment.IndexOf('>');

// the very last char is the closing tag

if(fragment.Length-1 == loc)

continue;

if(loc>0)

sb.Append(fragment.Substring(loc+1));

else

sb.Append(fragment);

}

return sb.ToString();

}

gyurisc - Monday, April 14, 2003 8:45:00 AM

Comments have been disabled for this content.