Recipe 13.5.
Creating a Nongreedy Pattern
Problem
You're
using a regular expression
but not seeing the correct results. The pattern is acting greedy,
matching more characters than you want it to.
Solution
Replace the regex with a nongreedy version to
match the smallest amount of characters possible.
Discussion
Whenever you create a pattern that includes
matches for character repetition using the * and + metacharacters or
the {
n , m }
metasequence, the pattern is greedy. A greedy pattern is
one that tries to consume as much text as possible, matching the
largest substring it can. Patterns are greedy because of the
underlying code in the regular expression engine, and understanding
how that engine works allows you to create more precise
regexes.
Consider that you want to remove HTML tags in a string, as described in
Recipe
13.4. Every HTML tag starts with an opening < and
ends in a closing >. In between the angle brackets,
there could be a wide variety of characters, such a numbers,
letters, quotes, the equal sign, etc. For the sake of simplicity,
instead of creating a character class for everything that could
appear between the < and >, a .*
will match "any character, any number of times." So, you construct
the regular expression /<.*>/g and try running it
against a string that contains HTML tags:
var example:String = "<b>hello</b>, world!";
// Displays: <b>hello</b>
trace( example.match( /<.*>/g ) );
You might have expected that the regular
expression would produce two matches, one for <b>
and another for </b>. From the preceding code block,
you can see that only one match was produced, the entire string
<b>hello</b>. This is an example of a greedy
pattern in action.
The first character in the expression is a
<, which is taken literally. It matches in the string
at the very first position so the engine moves on in the pattern
and keeps processing. The next character in the pattern is a
., which matches everything except a newline, followed by
the *, which repeats the . zero or more times.
This is where the pattern becomes greedy.
The . matches the b because it
falls under the "any character" umbrella. After the b in
the string is a >, which again falls under "any
character" andas you guessed itis matched by the .. The
engine keeps repeating the ., matching
hello</b>, world! up until the end of the
string. At the end of the string, the . fails because
there is nothing left; however, the pattern still needs to match a
closing >. Because the engine knows that the *
repetition has been fulfilled in matching the ., the
engine backtracks and starts
looking for a match to the > to complete the pattern.
Moving backward in the string from the end, the ! is not a
match to the > required by the pattern, so the engine
moves backward yet again knowing that the .* has been
fulfilled. It keeps moving backward through the d,
l, r, o, etc. until it meets the
> character just before the comma. The angle bracket at
this location matches the closing angle bracket in the pattern, and
the engine considers the pattern to have been matched successfully,
returning a match of <b>hello</b>.
The RegExp engine wants to match a
pattern as quickly as possible. Because a match has been made at
this point, there is no need to continue to backtrack and try to
find a smaller match. Therefore, greedy patterns always return the
longest match.
There are two possible solutions to greedy
patterns. The first is to make a pattern nongreedy, sometimes
called a lazy pattern. The second is to creatively use a
character class.
You make a pattern lazy by removing greedy
repetition. Lazy patterns match the smallest possible substring.
Creating a lazy pattern is as simple as adding the ?
metacharacter to the end of the repetition expression, as described
in Table 13-5.
Table 13-5. Lazy patterns
Expression |
Lazily matches the preceding
character or group |
??
|
Zero or one time |
*?
|
Zero or more times |
+?
|
One or more times |
{n,}?
|
At least n times |
{n,m}?
|
At least n but no more
than m times |
The greedy /<.*>/g pattern used
earlier to match HTML tags can be replaced with the lazy pattern of
/<.*?>/g, giving the desired effect of matching all
of the HTML tags:
var example:String = "<b>hello</b>, world!";
// Displays: <b>,</b>
trace( example.match( /<.*?>/g ) );
By making the pattern lazy, the underlying
RegExp engine treats the pattern differently than it would a
greedy pattern. The < is matched just like before, but
this time the . is followed by a lazy * (as
signified by the ?), and the engine knows to match as few
characters as possible. At this point, the requirement of "any
character, zero or more times" has already been met since zero
characters have been consumed and the engine starts looking for the
closing >. The b in the string is encountered,
which doesn't match >. The engine backtracks just like
before, but in doing so forces the lazy * to consume more
text. The .* is expanded to match the b, and the
engine again looks to match the >. The next character
in the string is >, which matches successfully and
completes the pattern. The engine therefore returns
<b> as a match, which is exactly the desired
result.
Instead of making a lazy pattern, you can
replace the . with a character class that includes every
character except the >, resulting in the
/<[^>]*>/g pattern:
var example:String = "<b>hello</b>, world!";
// Displays: <b>,</b>
trace( example.match( /<[^>]*>/g ) );
In this case, a character class might be the
better option because it eliminates the backtracking done by the
engine. The engine processes the < as normal, and then
matches characters up until the > in the string since
it's not a part of the character class. It then moves on to the
> in the pattern and reports a successful match because
> is found in the string.
See Also
Recipes 13.1,
13.3,
and 13.4
|