Stripping out HTML tags


This blog is amazing to find solutions. I enjoy to be part of this community.

Well another idea on removing HTML tags from Christian
Dehaeseleer:

" If your HTML is not "simple" then RegEx will not work... (what if your HTML contains "<" etc.).

Another option is to parse your HTML as XML with SgmlReader (available on GotDotNet) and then treat the XML as you wish (for instance using a default XSL Template will remove all tags...)
"


Hey folks, what do you think about this versus using Regex.

1 Comment

  • here is my quick and dirty solution, although with regular expressions is more elegant, I guess I need to buy a RegExp book :)))





    public static string StripHTMLTags(string html)


    {


    string[] open_fragments = html.Split(new Char[] {'&lt;'});


    StringBuilder sb = new StringBuilder();





    foreach(string fragment in open_fragments)


    {


    int loc = fragment.IndexOf('&gt;');





    // the very last char is the closing tag


    if(fragment.Length-1 == loc)


    continue;





    if(loc&gt;0)


    sb.Append(fragment.Substring(loc+1));


    else


    sb.Append(fragment);


    }





    return sb.ToString();


    }


Comments have been disabled for this content.