Using Regex to return the first N words in a string

Jeff Perrin needed a function to return the first N words in a string (to create a small summary or a snippet thingy). He did it using the manual and awkward method of parsing the string manually. That method is more error prone and usually makes for less readable code. Fortunately, you can use regular expressions here quite nicely. Here's a test that makes sure that we get the first 4 words in a string and the function "FindFirstWords" that does this very easily using a simple regular expression.

What I'm doing here is that I'm using the expression to find the first 4 occurrences of text that is composed of alphanumeric text with one or more spaces after it. Then I simply iterate over the match I found. The match should contain 4 captures inside it - one for each "word" that was found.

It's not fully tested as you can see. I only wrote one test to see it works on this sort of sentence. More tests could and should be added to test other cases. In fact, if this were reall TDD, I would have started with a test of an empty string, and continued on to test getting only one word, and then two and so on.

[Test]

public void TestRegexFindFirstNWords()

{

      const string INPUT =

"this is word four five six seven eight nine ten eleven twelve thirteen!";

      const int NUM_WORDS_TO_RETURN = 4;

 

      string output = FindFirstWords (INPUT, NUM_WORDS_TO_RETURN);

 

      string expectedOutput = "this is word four ";

      Assert.AreEqual(expectedOutput,output);

}

 

private string FindFirstWords (string input, int howManyToFind)

{

     // thanks to Jeff Attwood for making this code even simpler!

      string REGEX = @"([\w]+\s+){" + howManyToFind + "}";

      return Regex.Match(input,REGEX).Value;

}

Published Friday, January 07, 2005 4:47 AM by RoyOsherove

Comments

Thursday, January 06, 2005 10:40 PM by patag

# re: Using Regex to return the first N words in a string

Why not just look for the Nth single space?
Thursday, January 06, 2005 11:52 PM by Jeff Atwood

# re: Using Regex to return the first N words in a string

Why do you need the stringbuilder? Just return the first match from..

Regex.Match(s, "(\w+\s+){5}").ToString().Trim

given input of...

"this is word four five six seven eight nine ten eleven twelve thirteen!"

returns first match of..

"this is word four five"
Friday, January 07, 2005 6:18 AM by TrackBack

# ...Oh, and I'm also

Friday, January 07, 2005 7:51 AM by Roy Osherove

# re: Using Regex to return the first N words in a string

Jeff: Excellent idea!
that works just as well :)
Friday, January 07, 2005 9:24 AM by Aaron Robinson

# re: Using Regex to return the first N words in a string

And if Jeff's version doesn't have any matches, you'd just spit out the original string, on the assumption that it didn't have at least N words.
Friday, January 07, 2005 9:25 AM by Sam Smoot

# re: Using Regex to return the first N words in a string

If you replace spaces with word boundries then you don't have to Trim(), and punctuation won't break the function:

^(\w+\b.*?){4}

Also, this matches only the first set, so theoretically it may be faster?
Saturday, January 08, 2005 2:09 AM by Jeff Atwood

# re: Using Regex to return the first N words in a string

Oh yeah, we definitely should have used "^" so we don't get multiple matches.
Sunday, January 09, 2005 4:39 AM by Alex Lvovich

# re: Using Regex to return the first N words in a string

I have a function that checks if string contains a specific pattern, I do it by using string.IndexOf method. Using regex seem to be more elegant solution for this. My question is if using regex is more efficient way for this problem?
Friday, January 14, 2005 9:12 AM by Green Dragon

# re: Using Regex to return the first N words in a string

Hey Roy, hope you don't mind that i linked to your post in my Wiki: http://greendragon.myserver.org:1600/FlexWiki/default.aspx/InfiniteLoops.FindFirstNWordsInString
Wednesday, August 09, 2006 12:47 PM by David

# re: Using Regex to return the first N words in a string

What if you had a sentence, and wanted to just return the nth word from a sentence? Like the third or fourth word