Paulo Morgado

.NET Development & Architecture

Recent Articles

view all

Events

Projects

Recent Readers

Visitor Locations

Visitor Locations

Disclaimer

The opinions and viewpoints expressed in this site are mine and do not necessarily reflect those of Microsoft, my employer or any community that I belong to. Any code or opinions are offered as is. Products or services mentioned are purchased by me, made available to me by my employer or the manufacturer/vendor which doesn't influence my opinion in any way.

Cleaning HTML With Regular Expressions

While participating in a forum discussion, the need to clean up HTML from "dangerous" constructs came up.

In the present case it was needed to remove SCRIPT, OBJECT, APPLET, EMBBED, FRAMESET, IFRAME, FORM, INPUT, BUTTON and TEXTAREA elements (as far as I can think of) from the HTML source. Every event attribute (ONEVENT) should also be removed keep all other attributes, though.

HTML is very loose and extremely hard to parse. Elements can be defined as a start tag (<element-name>) and an end tag (</element-name>) although some elements don't require the end tag. If XHTML is being parsed, elements without an end tag require the tag to be terminated with /> instead of just >.

Attributes are not easier to parse. By definition, attribute values are required to be delimited by quotes (') or double quotes ("), but some browsers accept attribute values without any delimiter.

We could build a parser, but then it will become costly to add or remove elements or attributes. Using a regular expression to remove unwanted elements and attributes seems like the best option.

First, lets capture all unwanted elements with start and end tags. To capture these elements we must:

  • Capture the begin tag character followed by the element name (for which we will store its name - t): <(?<t>element-name)
  • Capture optional white spaces followed by any character: (\s+.*?)?
  • Capture the end tag character: >
  • Capture optional any characters: .*?
  • Capture the begin tag character followed by closing tag character, the element name (referenced by the name - t) and the end tag character: </\k<t>>
<(?<t>tag-name(\s+.*?)?>.*?</\k<t>>

To capture all unwanted element types, we end up with the following regular expression:

<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\s+.*?)?>.*?</\k<t>>

Next, lets capture all unwanted elements without an end tag. To capture these elements we must:

  • Capture the begin tag character followed by the element name: <element-name
  • Capture optional white spaces followed by any character: (\s+.*?)?
  • Capture an optional closing tag character: /?
  • Capture the end tag character: >
<tag-name(\s+.*?)?/?>

To capture all unwanted element types, we end up with the following regular expression:

<(script|object|applet|embbed|frameset|iframe|form|textarea|input|button)(\s+.*?)?/?>

To remove those unwanted elements from the source HTML, we can combine these two previous regular expressions into one and replace any match with an empty string:

Regex.Replace(
    sourceHtml,
    "|(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"    ,
string.Empty);

And finally, the unwanted attributes. This one is trickier because we want to capture unwanted attributes inside an element's start tag. To achieve that, we need to match an element's opening tag and capture all attribute definitions. To capture these attributes we must:

  • Match but ignore the begin tag character followed by any element name: (?<=<\w+)
  • Match all:
    • Don’t capture mandatory with spaces: (?:\s+)
    • Capture attribute definition:
      • Capture mandatory attribute name: \w+
      • Capture mandatory equals sign: =
      • Capture value specification in one of the forms:
        • Capture double quoted value: "[^"]*"
        • Capture single quoted value: '[^']*'
        • Capture unquoted value: .*?
  • Match but ignore end tag: (?=/?>)
(?<=<\w+)((?:\s+)(\w+=(("[^"]*")|('[^']*')|(.*?)))*(?=/?>)

The problem with the previous regular expression is that it matches the start tag and captures the whole list of attributes and not each unwanted attribute by itself. This prevents us from from replacing each match with a fixed value (empty string).

To solve this, we have to name what we want to capture and use the Replace overload that uses a MatchEvaluator.

We could capture unwanted attributes as we did for the unwanted elements, but then we would need to remove them from the list of all the element’s attributes. Instead, we’ll capture the wanted attributes and build the list of attributes. To identify the wanted attributes, we’ll need to name them (a). The resulting code will be something like this:

Regex.Replace(
    sourceHtml,
    "((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    }
);

To avoid parsing the source HTML more than once, we can combine all the regular expressions into a single one.

Because we are still outputting only the wanted attributes, there’s no change to the match evaluator.

A few options (RegexOptions) will also be added to increase functionality and performance:

  • IgnoreCase: For case-insensitive matching.
  • CultureInvariant: For ignoring cultural differences in language.
  • Multiline: For multiline mode.
  • ExplicitCapture: For capturing only named captures.
  • Compiled: For compiling the regular expression into an assembly. Only if the regular expression is to be used many times.

The resulting code will be this:

Regex.Replace(
    sourceHtml,
    "(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"
        + "|((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    },
    RegexOptions.IgnoreCase
        | RegexOptions.Multiline
        | RegexOptions.ExplicitCapture
        | RegexOptions.CultureInvariant
        | RegexOptions.Compiled
);

This was not extensively tested and there might be some wanted HTML remove and some unwanted HTML kept, but it’s probably very close to a good solution.

Posted: Sep 05 2011, 05:42 AM by Paulo Morgado | with 5 comment(s)
Filed under: , ,

Comments

LuxIt said:

Very interesting article! Thanks :)

# September 5, 2011 2:05 PM

Cleaning HTML With Regular Expressions | .NET, ASP.NET and HTML | Syngu said:

Pingback from  Cleaning HTML With Regular Expressions | .NET, ASP.NET and HTML | Syngu

# September 6, 2011 1:37 AM

RichardD said:

As one StackOverflow user wrote:

"Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp. Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The <center> cannot hold it is too late. The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes."

stackoverflow.com/.../1732454

www.codinghorror.com/.../parsing-html-the-cthulhu-way.html

# September 6, 2011 3:38 PM

Paulo Morgado said:

Parsing might have been the wrong term but the goal was not to render the HTML but to remove "dangerous" items. (Well, the real goal was to play with regular expressions :) )

Do you have any example of something that escapes this clensing?

# September 6, 2011 3:58 PM
Leave a Comment

(required) 

(required) 

(optional)

(required)