Redefining Boundaries - \b and \B
\b and \B are useful metacharacters; they provide a simple syntax to easily find items that exist around word boundaries (or, in the case of \B, non-word boundaries). For example, to find the word "per" you can wrap the actual text inside \b's to ensure that the phrase is not inadvertantly matched when found inside of "person" or "Supercalif", i.e.:
\b(per)\b // parenthesis included for clarity
Now, \b is not magic, it doesn't actually know what are and aren't words it simply wraps a much more complicated set of functionality. The actual logic for \b goes something like this...
Find a "position" at which:
- there are no 'word' characters to the left and there
are word characters to the right, or
- there are
'word' characters to the left and there are no word
characters to the right
Using the lookaround features of the .NET regular expression engine, you can emulate this behaviour like so:
(?<!\w)(?=\w)Foo(?<=\w)(?!\w)
This information probably seems like excessive noise because, it's obviously not often that you will want to write \b in its longhanded format however, there are times when it is useful to do so.
While building a tool recently that can Mark-up text, I had to create an algorithm that could dynamically generate regex's to find keywords from any programming language. At first this task seemed like a fairly simple one, that is, join all of the keywords with the regex or metacharacter - "|" - and wrap inside word boundary metacharacters. For example, matching T-SQL keywords would produce a pattern looking like so:
\b(DECLARE|SET|OR|BEGIN|END.....)\b
The problem is that, because of the nature of \b, the "@" character would not be counted as a word character which would mean that I would need a different, hard-coded set of rules for things like: "@@SPID" and "@@FETCH_STATUS". To get around this problem I not only dynamically generate the list, but, based on whether any words have non-\w characters in them, I also had to dynamically generate the representation of the word boundary marker. This piece of psuedo-code demonstrates this:
using
System ;
using
System.Text.RegularExpressions ;
namespace
Regex Snippets.Tests
{
public
class
Foo
{
public
static
void
Main()
{
string
source
=
@"(DECLARE|SET|OR|BEGIN|END|@@FETCH_STATUS|@@SPID)"
;
string
marker
=
GetBoundaryChars( source ) ;
source
=
String.Format(marker, source) ;
Console.WriteLine( source ) ;
Console.ReadLine() ;
}
private
static
string
GetBoundaryChars(
string
words )
{
string
pattern
=
@"[^\|\w\(\)]"
;
Regex re
=
new
Regex(
pattern,
RegexOptions.IgnoreCase|RegexOptions.Multiline
) ;
string
marker
=
""
;
if( re.Match( words ).Success )
{
for( Match m
=
re.Match( words ); m.Success; m
=
m.NextMatch() )
{
if( marker.IndexOf( m.Value ) == -1 )
marker += m.Value ;
}
marker
=
String.Format(@"(?<![{0}\w])(?=[{0}\w]){1}(?<=[{0}\w])(?![{0}\w])", marker,
"{0}") ;
}
else
{
marker
=
"\b{0}\b"
;
}
return
marker ;
}
}
}