Recipe 13.5. Creating a Nongreedy Pattern

Problem

You're using a regular expression but not seeing the correct results. The pattern is acting greedy, matching more characters than you want it to.

Solution

Replace the regex with a nongreedy version to match the smallest amount of characters possible.

Discussion

Whenever you create a pattern that includes matches for character repetition using the * and + metacharacters or the { n , m } metasequence, the pattern is greedy. A greedy pattern is one that tries to consume as much text as possible, matching the largest substring it can. Patterns are greedy because of the underlying code in the regular expression engine, and understanding how that engine works allows you to create more precise regexes.

Consider that you want to remove HTML tags in a string, as described in Recipe 13.4. Every HTML tag starts with an opening < and ends in a closing >. In between the angle brackets, there could be a wide variety of characters, such a numbers, letters, quotes, the equal sign, etc. For the sake of simplicity, instead of creating a character class for everything that could appear between the < and >, a .* will match "any character, any number of times." So, you construct the regular expression /<.*>/g and try running it against a string that contains HTML tags:

var example:String = "<b>hello</b>, world!";

// Displays: <b>hello</b>
trace( example.match( /<.*>/g ) );

You might have expected that the regular expression would produce two matches, one for  and another for . From the preceding code block, you can see that only one match was produced, the entire string hello. This is an example of a greedy pattern in action.

The first character in the expression is a <, which is taken literally. It matches in the string at the very first position so the engine moves on in the pattern and keeps processing. The next character in the pattern is a ., which matches everything except a newline, followed by the *, which repeats the . zero or more times. This is where the pattern becomes greedy.

The . matches the b because it falls under the "any character" umbrella. After the b in the string is a >, which again falls under "any character" andas you guessed itis matched by the .. The engine keeps repeating the ., matching hello, world! up until the end of the string. At the end of the string, the . fails because there is nothing left; however, the pattern still needs to match a closing >. Because the engine knows that the * repetition has been fulfilled in matching the ., the engine backtracks and starts looking for a match to the > to complete the pattern. Moving backward in the string from the end, the ! is not a match to the > required by the pattern, so the engine moves backward yet again knowing that the .* has been fulfilled. It keeps moving backward through the d, l, r, o, etc. until it meets the > character just before the comma. The angle bracket at this location matches the closing angle bracket in the pattern, and the engine considers the pattern to have been matched successfully, returning a match of hello.

The RegExp engine wants to match a pattern as quickly as possible. Because a match has been made at this point, there is no need to continue to backtrack and try to find a smaller match. Therefore, greedy patterns always return the longest match.

There are two possible solutions to greedy patterns. The first is to make a pattern nongreedy, sometimes called a lazy pattern. The second is to creatively use a character class.

You make a pattern lazy by removing greedy repetition. Lazy patterns match the smallest possible substring. Creating a lazy pattern is as simple as adding the ? metacharacter to the end of the repetition expression, as described in Table 13-5.

Table 13-5. Lazy patterns

Expression	Lazily matches the preceding character or group
??	Zero or one time
*?	Zero or more times
+?	One or more times
{`n`,}?	At least `n` times
{`n`,`m`}?	At least `n` but no more than `m` times

The greedy /<.*>/g pattern used earlier to match HTML tags can be replaced with the lazy pattern of /<.*?>/g, giving the desired effect of matching all of the HTML tags:

var example:String = "<b>hello</b>, world!";

// Displays: <b>,</b>
trace( example.match( /<.*?>/g ) );

By making the pattern lazy, the underlying RegExp engine treats the pattern differently than it would a greedy pattern. The < is matched just like before, but this time the . is followed by a lazy * (as signified by the ?), and the engine knows to match as few characters as possible. At this point, the requirement of "any character, zero or more times" has already been met since zero characters have been consumed and the engine starts looking for the closing >. The b in the string is encountered, which doesn't match >. The engine backtracks just like before, but in doing so forces the lazy * to consume more text. The .* is expanded to match the b, and the engine again looks to match the >. The next character in the string is >, which matches successfully and completes the pattern. The engine therefore returns  as a match, which is exactly the desired result.

Instead of making a lazy pattern, you can replace the . with a character class that includes every character except the >, resulting in the /<[^>]*>/g pattern:

var example:String = "<b>hello</b>, world!";

// Displays: <b>,</b>
trace( example.match( /<[^>]*>/g ) );

In this case, a character class might be the better option because it eliminates the backtracking done by the engine. The engine processes the < as normal, and then matches characters up until the > in the string since it's not a part of the character class. It then moves on to the > in the pattern and reports a successful match because > is found in the string.