Redefining Boundaries - \b and \B

Monday, September 22, 2003

Regex

\b and \B are useful metacharacters; they provide a simple syntax to easily find items that exist around word boundaries (or, in the case of \B, non-word boundaries). For example, to find the word "per" you can wrap the actual text inside \b's to ensure that the phrase is not inadvertantly matched when found inside of "person" or "Supercalif", i.e.:

\b(per)\b // parenthesis included for clarity

Now, \b is not magic, it doesn't actually know what are and aren't words it simply wraps a much more complicated set of functionality. The actual logic for \b goes something like this...

Find a "position" at which:

- there are no 'word' characters to the left and there are word characters to the right, or
- there are 'word' characters to the left and there are no word characters to the right

Using the lookaround features of the .NET regular expression engine, you can emulate this behaviour like so:

(?<!\w)(?=\w)Foo(?<=\w)(?!\w)

This information probably seems like excessive noise because, it's obviously not often that you will want to write \b in its longhanded format however, there are times when it is useful to do so.

While building a tool recently that can Mark-up text, I had to create an algorithm that could dynamically generate regex's to find keywords from any programming language. At first this task seemed like a fairly simple one, that is, join all of the keywords with the regex or metacharacter - "|" - and wrap inside word boundary metacharacters. For example, matching T-SQL keywords would produce a pattern looking like so:

\b(DECLARE|SET|OR|BEGIN|END.....)\b

The problem is that, because of the nature of \b, the "@" character would not be counted as a word character which would mean that I would need a different, hard-coded set of rules for things like: "@@SPID" and "@@FETCH_STATUS". To get around this problem I not only dynamically generate the list, but, based on whether any words have non-\w characters in them, I also had to dynamically generate the representation of the word boundary marker. This piece of psuedo-code demonstrates this:

using System ;
using System.Text.RegularExpressions ;

namespace Regex Snippets.Tests
{
    public class Foo
    {
        public static void Main()
        {
            string source = @"(DECLARE|SET|OR|BEGIN|END|@@FETCH_STATUS|@@SPID)" ;
            string marker = GetBoundaryChars( source ) ;
            source = String.Format(marker, source) ;

            Console.WriteLine( source ) ;
            Console.ReadLine() ;
        }

        private static string GetBoundaryChars( string words )
        {
            string pattern = @"[^\|\w\(\)]" ;

            Regex re = new Regex(
                    pattern,
                    RegexOptions.IgnoreCase|RegexOptions.Multiline
                ) ;
             string marker = "" ;
            if( re.Match( words ).Success )
            {
                for( Match m = re.Match( words ); m.Success; m = m.NextMatch() )
                {
                    if( marker.IndexOf( m.Value ) == -1 )
                        marker += m.Value ;
                }
                marker = String.Format(@"(?<![{0}\w])(?=[{0}\w]){1}(?<=[{0}\w])(?![{0}\w])", marker, "{0}") ;
            }
            else
            {
                marker = "\b{0}\b" ;
            }

            return marker ;
        }
    }
}

No Comments