Cleaning HTML With Regular Expressions

Monday, September 5, 2011

.NET HTML Regex

While participating in a forum discussion, the need to clean up HTML from "dangerous" constructs came up.

In the present case it was needed to remove SCRIPT, OBJECT, APPLET, EMBBED, FRAMESET, IFRAME, FORM, INPUT, BUTTON and TEXTAREA elements (as far as I can think of) from the HTML source. Every event attribute (ONEVENT) should also be removed keep all other attributes, though.

HTML is very loose and extremely hard to parse. Elements can be defined as a start tag (<element-name>) and an end tag (</element-name>) although some elements don't require the end tag. If XHTML is being parsed, elements without an end tag require the tag to be terminated with /> instead of just >.

Attributes are not easier to parse. By definition, attribute values are required to be delimited by quotes (') or double quotes ("), but some browsers accept attribute values without any delimiter.

We could build a parser, but then it will become costly to add or remove elements or attributes. Using a regular expression to remove unwanted elements and attributes seems like the best option.

First, lets capture all unwanted elements with start and end tags. To capture these elements we must:

Capture the begin tag character followed by the element name (for which we will store its name - t): <(?<t>element-name)
Capture optional white spaces followed by any character: (\s+.*?)?
Capture the end tag character: >
Capture optional any characters: .*?
Capture the begin tag character followed by closing tag character, the element name (referenced by the name - t) and the end tag character: </\k<t>>

<(?<t>tag-name(\s+.*?)?>.*?</\k<t>>

To capture all unwanted element types, we end up with the following regular expression:

<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\s+.*?)?>.*?</\k<t>>

Next, lets capture all unwanted elements without an end tag. To capture these elements we must:

Capture the begin tag character followed by the element name: <element-name
Capture optional white spaces followed by any character: (\s+.*?)?
Capture an optional closing tag character: /?
Capture the end tag character: >

<tag-name(\s+.*?)?/?>

To capture all unwanted element types, we end up with the following regular expression:

<(script|object|applet|embbed|frameset|iframe|form|textarea|input|button)(\s+.*?)?/?>

To remove those unwanted elements from the source HTML, we can combine these two previous regular expressions into one and replace any match with an empty string:

Regex.Replace(
    sourceHtml,
    "|(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"    ,
string.Empty);

And finally, the unwanted attributes. This one is trickier because we want to capture unwanted attributes inside an element's start tag. To achieve that, we need to match an element's opening tag and capture all attribute definitions. To capture these attributes we must:

Match but ignore the begin tag character followed by any element name: (?<=<\w+)
Match all:
- Don’t capture mandatory with spaces: (?:\s+)
- Capture attribute definition:
  - Capture mandatory attribute name: \w+
  - Capture mandatory equals sign: =
  - Capture value specification in one of the forms:
    - Capture double quoted value: "[^"]*"
    - Capture single quoted value: '[^']*'
    - Capture unquoted value: .*?
Match but ignore end tag: (?=/?>)

(?<=<\w+)((?:\s+)(\w+=(("[^"]*")|('[^']*')|(.*?)))*(?=/?>)

The problem with the previous regular expression is that it matches the start tag and captures the whole list of attributes and not each unwanted attribute by itself. This prevents us from from replacing each match with a fixed value (empty string).

To solve this, we have to name what we want to capture and use the Replace overload that uses a MatchEvaluator.

We could capture unwanted attributes as we did for the unwanted elements, but then we would need to remove them from the list of all the element’s attributes. Instead, we’ll capture the wanted attributes and build the list of attributes. To identify the wanted attributes, we’ll need to name them (a). The resulting code will be something like this:

Regex.Replace(
    sourceHtml,
    "((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    }
);

To avoid parsing the source HTML more than once, we can combine all the regular expressions into a single one.

Because we are still outputting only the wanted attributes, there’s no change to the match evaluator.

A few options (RegexOptions) will also be added to increase functionality and performance:

IgnoreCase: For case-insensitive matching.
CultureInvariant: For ignoring cultural differences in language.
Multiline: For multiline mode.
ExplicitCapture: For capturing only named captures.
Compiled: For compiling the regular expression into an assembly. Only if the regular expression is to be used many times.

The resulting code will be this:

Regex.Replace(
    sourceHtml,
    "(<(?<t>script|object|applet|embbed|frameset|iframe|form|textarea)(\\s+.*?)?>.*?</\\k<t>>)"
        + "|(<(script|object|applet|embbed|frameset|iframe|form|input|button|textarea)(\\s+.*?)?/?>)"
        + "|((?<=<\\w+)((?:\\s+)((?:on\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))|(?<a>(?!on)\\w+=((\"[^\"]*\")|('[^']*')|(.*?)))))*(?=/?>))",
    match =>
    {
        if (!match.Groups["a"].Success)
        {
            return string.Empty;
        }
        
        var attributesBuilder = new StringBuilder();
        
        foreach(Capture capture in match.Groups["a"].Captures)
        {
            attributesBuilder.Append(' ');
            attributesBuilder.Append(capture.Value);
        }
        
        return attributesBuilder.ToString();
    },
    RegexOptions.IgnoreCase
        | RegexOptions.Multiline
        | RegexOptions.ExplicitCapture
        | RegexOptions.CultureInvariant
        | RegexOptions.Compiled
);

This was not extensively tested and there might be some wanted HTML remove and some unwanted HTML kept, but it’s probably very close to a good solution.

Parsing might have been the wrong term but the goal was not to render the HTML but to remove "dangerous" items. (Well, the real goal was to play with regular expressions :) )

Do you have any example of something that escapes this clensing?

Paulo Morgado - Tuesday, September 6, 2011 7:58:34 PM

Nice! Worked perfectly for my app and speeded up my application by 5 fold in this area!

Surely is appreciated by me.

Roger Douglass - Sunday, August 5, 2012 10:22:28 PM

2 Comments