8 Comments

  • This is probably not the best regular expression for stripping HTML either. A RegEx parser that performs greedy matching (e.g. try to match as much characters as possible) will match \1 in '<(.*)>' to 'i>important info</i' in '<i>important info</i>'.





    I would suggest either use non-greedy matching via '<.*?>' (e.g. match the first '>' you find, not the last possible one) or use a more specific pattern like '<[^>]+>' -- e.g. match a '<' than match one or more sequential characters that are not '>' up until the first '>' you find.





    Regex is a dark and deep hole that once you fall in, it's hard to get out; but like a big hole, there's light at one end of it ;)





  • I'm not sure if this topic is still of interest, but I thought I'd share my findings and hopefully address Paschal's need to keep certain tags. I am using the following regular expression to selectively strip potentially malicious tags from HTML text.



    string output = Regex.Replace(input, @"</?(?i:script|embed|object|frameset|frame|iframe|meta|link|style)(.|\n)*?>", "");



    This expression doesn't suffer from some of the side effects of using the first expression (such as changing "5 < 8 and 3 > 1" to "5 1") and provides the added benefit of being case insensitive (?i:). And you can easily add/remove tag names that you want/don't want to strip.

  • No technique offered will suffer perfectly valid (x)html such as:

    <div title=">">...</div>



    Just pointing it out, no time crafting a good regex now :)

  • That's a good point, and definitely a risk - however, even in xhtml, and especially in xml, this should be written as <div title=">">...</div>



  • <a>xx</a>

  • You should point out that you need a reference to System.Text.RegularExpressions

    in your application for this to work

  • For the regular TAG MATCHING - why don't you just try



    (<[^>]+>)





    It will match all tags (also the nested ones) together with any attributes the tags contain!!!



    cu

  • What about "   " ??

Comments have been disabled for this content.