Bad Word Filter With Regular Expressions

I have seen many versions of these and a lot of the time people are expecting that a bad word would be written complete, I.e. BADWORD.  Sometimes they overlook the fact that others get hold of this rule and simply bypass by adding symbols in between, I.e. B*A*D*W*O*R*D.  Of course this would not be recognized if simply searching the string for BADWORD.

This technique I have used here relies on a base list in XML.  I have created a class which is called BarWordFilter and with this I use the singleton pattern.  I do this because the class has to first compile a list of Regexs from the words inside the base XML File, and as I do not want a re compilation of these at every bad word check, I have opted for the singleton pattern.

for any word which is in the list the rendered pattern will follow a set trend.  So if we look again at BADWORD, the regular expression I have come with would be as follows.

Hide Code [-]
([b|B][\W]*[a|A][\W]*[d|D][\W]*[w|W][\W]*[o|O][\W]*[r|R][\W]*[d|D][\W]*)
{..} Click Show Code

 

What I do is I create the pattern at runtime.  I look for instances of lower or upper case, and ultimately anything which, if we ignore anything which is not a character, spells our bad word.

 

I have create a simple test page here to have a go.  Please note I have only got the real serious words in the list for the purposes of this demonstration.  I have not published this list as I do not think it is necessary.  I have used a simple XML structure so please feel free to copy the code here, and generate as many bad words as you like <s>.

 

Example Page : http://andrewrea.co.uk/badwordfilter/Default.aspx

 

The BadWordFilter class

Hide Code [-]
using System;
using System.Collections.Generic;
using System.Text;
using System.Text.RegularExpressions;
using System.Web;
using System.Xml;

/// <summary>
/// Summary description for BadWordFilter
/// </summary>
public class BadWordFilter
{

    /// <summary>
    /// These are the options which I use in order to determine the way I handle any bad text
    /// </summary>
    public enum CleanUpOptions
    {
        ReplaceEachWord,
        BlankBadText,
        ReplaceWholeText
    }

    /// <summary>
    /// Private constructor and instantiate the list of regex
    /// </summary>
    private BadWordFilter()
    {
        //
        // TODO: Add constructor logic here
        //
        patterns = new List<Regex>();
    }

    /// <summary>
    /// The patterns
    /// </summary>
    private List<Regex> patterns;

    
    public List<Regex> Patterns
    {
        get { return patterns; }
        set { patterns = value; }
    }

    private static BadWordFilter m_instance = null;

    public static BadWordFilter Instance
    {
        get
        {
            if (m_instance == null)
                m_instance = CreateBadWordFilter(HttpContext.Current.Server.MapPath("listofwords.xml"));

            return m_instance;
        }
    }

    /// <summary>
    /// Create all the patterns required and add them to the list
    /// </summary>
    /// <param name="badWordFile"></param>
    /// <returns></returns>
    protected static BadWordFilter CreateBadWordFilter(string badWordFile)
    {
        BadWordFilter filter = new BadWordFilter();
        XmlDocument badWordDoc = new XmlDocument();
        badWordDoc.Load(badWordFile);

        //Loop through the xml document for each bad word in the list
        for (int i = 0; i < badWordDoc.GetElementsByTagName("word").Count; i++)
        {
            //Split each word into a character array
            char[] characters = badWordDoc.GetElementsByTagName("word")[i].InnerText.ToCharArray();
            
            //We need a fast way of appending to an exisiting string
            StringBuilder patternBuilder = new StringBuilder();

            //The start of the patterm
            patternBuilder.Append("(");

            //We next go through each letter and append the part of the pattern.
            //It is this stage which generates the upper and lower case variations
            for (int j = 0; j < characters.Length; j++)
            {
                patternBuilder.AppendFormat("[{0}|{1}][\\W]*", characters[j].ToString().ToLower(), characters[j].ToString().ToUpper());
            }

            //End the pattern
            patternBuilder.Append(")");

            //Add the new pattern to our list.
            filter.Patterns.Add(new Regex(patternBuilder.ToString()));
        }
        return filter;
    }

    /// <summary>
    /// The function which returns the manipulated string
    /// </summary>
    /// <param name="input"></param>
    /// <param name="options"></param>
    /// <returns></returns>
    public string GetCleanString(string input, CleanUpOptions options)
    {
        if (options == CleanUpOptions.BlankBadText)
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we want to return an empty string if we find any bad word
                if (patterns[i].Match(input).Success)
                    return String.Empty;
            }
        }
        else if (options == CleanUpOptions.ReplaceWholeText)
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we want to return a specified statement if we find any bad word
                if (patterns[i].Match(input).Success)
                    return "The text contains unsuitable content";
            }
        }
        else
        {
            for (int i = 0; i < patterns.Count; i++)
            {
                //In this instance we actually replace each instance of any bad word with a specified string.
                input = patterns[i].Replace(input, "**Unsuitable Word**");
            }
        }

        //return the manipulated string
        return input;
    }
}
{..} Click Show Code

 

The XML file which I have used is below.  Dead simple, but does the job.

Hide Code [-]
<?xml version="1.0" encoding="utf-8" ?>
<words>
  <word>bad word</word>
  <word>ugly word</word>
  <word>bla bla bla</word>
</words>
{..} Click Show Code

 

Cheers,

 

Andrew :-)

Published Saturday, May 3, 2008 9:14 AM by REA_ANDREW
Filed under: , ,

Comments

# re: Bad Word Filter With Regular Expressions

Saturday, May 3, 2008 3:45 PM by SarcasticBaldGuy

Doesn't catch sh!t $hit or $h!t

# re: Bad Word Filter With Regular Expressions

Sunday, May 4, 2008 7:04 PM by REA_ANDREW

SarcasticBaldGuy

Obviously it would be very difficult to interpret symbols for letters as you have pointed out, using dollar sign for S and exclamation mark for an i.  

Do you know of any current techniques which are used to counter act that.  I would say that the bad word file should be used to a great degree, with obvious filters in place, and maybe a common mapping class for synbol to letter i.e.

$ => S|s

! => \||

In fairness though, it is a judgement call, as what they have typed in that case would not be an offensive word, yet it could be interpretted as offensive.  I would opt for the big difference there.  I could write "I sh!t the door," so in this case I have attempted to spell shut and not the bad word expected from your example.

My example is early doors and in need of refinement big time, but I feel the theory is sound and provides a good bounding block for other to build on.

Thanks for the post, I appreciate the feedback!!

Andrew :-)

# re: Bad Word Filter With Regular Expressions

Thursday, September 4, 2008 9:38 AM by jaseen

what if user writes BADDWORD you should place a + after the letters bracket...

([b|B]+[\W]*[a|A]+[\W]*[d|D]+[\W]*[w|W]+[\W]*[o|O]+[\W]*[r|R]+[\W]*[d|D]+[\W]*)

but again this sucks

when user writes BAD-DWORD... or fuc-cker...

i think it's really hard to clear them... ppl will find a way to surpass it....

# re: Bad Word Filter With Regular Expressions

Friday, October 3, 2008 3:29 PM by Brian Boatright

Could you please post the code for the demo page, minus the XML file which is understandable. I converted your C# to VB but I'm having trouble using it.

# re: Bad Word Filter With Regular Expressions

Saturday, October 4, 2008 3:21 AM by REA_ANDREW

Hi Brian, sure.  I have zipped the files for: here is the link.

andrewrea.co.uk/.../BadWordFilter.rar

Cheers,

Andrew

# re: Bad Word Filter With Regular Expressions

Sunday, October 12, 2008 5:23 PM by REA_ANDREW

Hi Enzo

The problem with the current build is that it searches for the word as part or as a whole in any word.  I think what you require is an enhancement to this where by you can differentiate the types of searches made i.e. part word searches or whole word searches.  So in your case the word "for" would come under the latter so in effect the XML file would have to change to incorporate different methods of search or rather different conditions.  

I am thinking of staring this as a CodePlex Project so that I can continually update, I will do this either tomorrrow or the next day.  If you would like access to make contributions to the small project, let me know and I will give you write access when it comes online at CodePlex.

Cheers,

Andrew

# re: Bad Word Filter With Regular Expressions

Monday, June 15, 2009 4:02 PM by David

Any updates to the BadWordFilter.cs.  I look for it in CodePlex and could not find it.

Thanks,

Dave

# re: Bad Word Filter With Regular Expressions

Monday, June 29, 2009 4:58 PM by Gilberto

To over come the issue with the words getting replaced that can possibly be added into the middle of a word for example brass will get cut if a** is in your xml.  I added an attribute to the xml

<word middleWord="true">filtered word</word>

then in the code just check for the attibute being true.  

if its true my expression will have the /b added to the beginining and remove the astrix on the last char in the string.  So using the badword example expression from above the new expression for only finding that word would be.

\b([b|B][\W]*[a|A][\W]*[d|D][\W]*[w|W][\W]*[o|O][\W]*[r|R][\W]*[d|D][\W])

so if i typed abadword it would not find it.  Basicly words can still be snuck through but should if careful give you some flexibility with what words you allow inside of words and what ones you dont

# re: Bad Word Filter With Regular Expressions

Monday, October 5, 2009 1:41 PM by Bob

Terrific application of dynamic creation of patterns and filtering offensive terms.

Only noticeable drawback - all of the characters of the original word must be present, one workaround could be to make vowels optional - $h!t, cr@p - as well as making the $ interchangable with "S" by treating it like a consonant, or adding commonly used badword hacks to the definition list.

Something like \b[a-zA-Z$$][/W]* ... [a-zA-Z$$]\b , keeping the character case insensitive, etc.  Or, perish the thought, asking folks to be courteous and not use annoying text decoration$ in their on-line p!r@o#s$e.  Honestly, how many words are spelled with symbols smack-dab in the middle? OK, smack-dab doesn't count.  Gotta' start another list :o)

Thanks for posting your solution - I plan on experimenting with this to provide some level of protection on my site's comments page.

# re: Bad Word Filter With Regular Expressions

Thursday, February 18, 2010 6:41 AM by NoLimitList

This seems to work fine for just changing the bad word in the textbox, but I can't seem to get it working with a custom validator for a Form View. I want to trigger an error to prevent the data from being inserted so that the author can rewrite stuff. A of now it just inserts things with "unwanted word" instead of what was originally there I tried the following to trigger an error:

If text.Text = BadWordFilter.Instance.GetCleanString(text.Text, BadWordFilter.CleanUpOptions.ReplaceEachWord) Then

args.IsValid = False

End If

Shouldn't this trigger an error preventing insertion of the data while also replacing the bad words?

# re: Bad Word Filter With Regular Expressions

Thursday, February 18, 2010 7:15 AM by NoLimitList

Here is an idea that would be fun, but not necessary. Can this be set up with each bad word in the XML file having a specific word to replace it.

I am trying to stop people from bad mouthing my company on my own website without requiring human oversight. For example if someone says, "this is a bad company that should be sued" with the bad words being "bad" and "sued". Could "bad" be replaced with "great" and "sued" be replaced with "awarded for excellence" so that what ends up on the sites blog/forum reads "this is a great company that should be awarded for excellence"?

# re: Bad Word Filter With Regular Expressions

Thursday, February 18, 2010 5:27 PM by NoLimitList

I fixed my problem and need to stop coding at 4 AM with little sleep. To create an error message without inserting bad words into a database just add the following:

VB.Net

Dim mytext As TextBox = CType(myFormView.FindControl("myTextBox"), TextBox)

Dim mytextcontent As String = mytext.Text

mytextcontent = BadWordFilter.Instance.GetCleanString(mytextcontent,BadWordFilter.CleanUpOptions.ReplaceWholeText)

Dim unsuitable As String = "The text contains unsuitable content"

args.IsValid = False

If mytextcontent = "The text contains unsuitable content" Then

args.IsValid = False

Else

args.IsValid = True

End If