Recipe 13.1.
Understanding Regular Expression Patterns
Problem
You want
to understand the basic building blocks of regular expressions.
Solution
Regular expressions are built by combining
characters with special meaning. First start by learning the basic
patterns, and then use this knowledge to put together more complex
patterns.
Discussion
A regular expression is a pattern
constructed using the regular expression syntax and is typically
used during text processing and pattern matching. The syntax
consists of characters, metacharacters, and
metasequences. Characters are
interpreted literally, whereas metacharacters and metasequences
have special meaning in the regular expression context. For
example, the regular expression built from the characters
hello matches the string "hello," whereas the regular
expression consisting only of the . metacharacter means
"any character" and matches "a", "b", "1", etc. Additionally, the
regular expression built from using the \\d metasequence
matches any digit, such as "1" or "9".
Before getting too in-depth with the regular
expression syntax, let's start by discussing how regular
expressions are created in ActionScript 3.0. Regular expressions
are built with the RegExp class and can be
constructed from either a string describing the pattern or from a
regular expression literal. A
regular expression literal is a forward slash, followed by the
regular expression pattern, followed by another forward slash, such
as / pattern /. The follow code
demonstrates how to create a regular expression for the pattern
hello by using both a string and the RegExp
constructor, as well as a regular expression literal:
// Create a pattern for hello using the RegExp class constructor
// passing in a string describing the pattern
var example1:RegExp = new RegExp( "hello" );
// Create the same hello pattern using a regular expression literal
var example2:RegExp = /hello/;
Both the example1 and example2
regular expressions match the same pattern, namely the string
"hello." In general, the pattern is the same regardless of which
method you use to create the regular expression. However, when a
backslash (\\) is part of the regular expression pattern,
using a string and the RegExp constructor gets tricky.
|
Because the RegExp object is created by
passing a string to the constructor, all references to \\
within the string must be escaped as \\\\. Since
\\ is also a special character in RegExp patterns,
to search for backslash in a regular expression, you must escape it
like this: \\\\\\\\.
|
|
Backslashes mark the beginning of an
escape sequence inside a string (see Recipe
12.3) and lose their meaning in the regular expression context.
That is, the backslash is interpreted as a special string character
before being interpreted in the regex. Therefore, if you want to
match a pattern with a backslash, you have to use a double backlash
in the string approach. The regular expression literal does not
have the same problem:
// Create a regular expression to match a digit (note the double
// backslash)
var example1:RegExp = new RegExp( "\\d" );
// Create a regular expression to match a digit
var example2:RegExp = /\d/;
// Create a regular expression that matches a backslash.
var example3:RegExp = new RegExp("\\\\");
// Create a regular expression to match a backslash
Var example4:RegExp = /\\/;
|
The preferred way to create regular expressions
is by using regular expression literals, and this convention is used
throughout the rest of this chapter.
|
|
By now you know that characters in a regular
expression pattern are interpreted literally. By combining
metacharacters and metasequences with regular characters, you can
create powerful combinations useful for matching many pattern
types. Let's take a look at the metacharacters, what they mean, and
how they might be used.
Table 13-1 summarizes
the regular expression metacharacters. Any time you want to use one
of these metacharacters literally, it must be preceded by a
backslash. For example, to match an open curly brace, use the
regular expression \\{.
Table 13-1. Regular expression
metacharacters
Expression |
Meaning |
Example |
?
|
Matches the preceding character zero or
one time (i.e., preceding character is optional) |
ta?k matches tak or tk but
not tik or taak |
*
|
Matches the preceding character zero or
more times |
wo*k matches wok, wk, or
woook, but not wak |
+
|
Matches the preceding character one or
more times |
craw+l matches crawl or crawwwl but not cral |
. (period) |
Matches any one character except newline
(unless the dotall flag is
set) |
c.ow matches crow or clow
but not cow |
^
|
Matches the start of the string (also
matches the start of a line when the multiline flag is set) |
^wap matches wap but not swap |
$
|
Matches the end of the string (also
matches the position before a newline "\\n" when the
multiline flag is set) |
ow$ matches ow but not owl |
|
|
Matches either the left or right side of
the pipe |
one|two matches one or two
but not ten |
\
|
Escapes the special meaning of the
metacharacter following the backslash |
\\. matches a period, instead of
"any one character" like the metacharacter . would |
( and ) |
Creates groups within the regular
expression to:Define the scope of |Define the scope of
{ and }Use back references, where \\1
refers to whatever is matched in the first group, etc. |
l(o|a)g matches log or lag
but not luga(b){1,2}
matches ab or abb but not a(a|b)\\1 matches aa or bb but not ab or
ba |
[ and ] |
Defines character classes that represent
matches for a single character. Presence of a indicates a
range of charactersA caret (^) at the
beginning negates the character class (everything except what is
defined by the class matches)Metacharacters do not need to be
escaped with a backslash (but a dash and beginning caret do) |
l[oa]g matches log or lag
but not lug[a-z] matches
any lowercase character such as a
or h but not 1, 2, or
Fl[^oa]g matches
lug but not lag or log[+\\-] matches + or - |
Similar to metacharacters, the metasequences are
described in Table 13-2 listing what
the expression matches along with an example.
Table 13-2. Regular expression
metasequences
Expression |
Matches |
Example |
{n}
|
Exactly n occurrences of
the preceding character or group |
Cre{2}l matches creel but not crel or creel |
{n,}
|
At least n occurrences of
the preceding character or group |
Cre{2,}l matches creel or creeeel but not crel |
{n,m}
|
At least n but no more
than m instances of preceding character or
group |
Cre{2,3}l matches creel or creeel but not crel or creel |
\A
|
At the start of the string; similar to
(^) |
\\Awap matches wap but not swap |
\b
|
Word boundary |
\\b7\\b matches 7 but not 71
or 573 |
\B
|
Non-word boundary |
\\B7\\B matches 573 but not 71 or 7 or
37 |
\d
|
Any numeric digit; same as
[0-9] |
a\\d matches a1 and a8 but
not ab or ad |
\D
|
Any non-digit character; same as
[^0-9] |
a\\D matches aB and ak,
but not a8 or a1 |
\n
|
The newline character |
a\\nb matches
"a\\nb" |
\r
|
The return character |
a\\rb matches
"a\\rb" |
\s
|
Single whitespace character (space, tab,
line feed, or form feed) |
King\\sTut matches King Tut and King\\tTut |
\S
|
Single nonwhitespace character |
\\STut matches gTut but not Tut |
\t
|
The tab character |
a\\tb matches
"a\\tb" |
\unnnn
|
The Unicode character specified by the hex
digits nnnn |
\\u000a matches
"\\n" |
\w
|
Any word character; same as
[A-Za-z0-9_] |
a\\wm matches arm and a8m,
but not a m or aém |
\W
|
Any non-word character; same as
[^A-Za-z0-9_] |
a\\Wm matches a m or aém,
but not a7m or aim |
\xnn
|
The ASCII character specified by the hex
digits nn |
\\x0a matches "\\n" |
\Z
|
The end of the string; matches
before the line break if the
string ends in one |
ab\\Z matches "ab\\n"
and ab, but not "ab\\nc" |
\z
|
The end of the string; matches
after the line break if the string
ends in one |
ab\\z matches ab, but not
"ab\\n" or "ab\\nc" |
Table 13-1 and
Table 13-2 describe the
basic syntax rules that make up regular expressions. By combining
characters, metacharacters, and metasequences, you can match a wide
variety of patterns. There is more to the story, however.
Regular expressions can also include certain
flags that indicate if any special processing should be done with
the pattern. There are five flags that can be accessed as
properties of a RegExp object: global,
ignoreCase, multiline, dotall, and
extended.
The flags must be set when the expression is
created; trying to modify a flag on a RegExp instance
results in a compile-time error:
// Generates a compile-time error in strict mode:
// Property is read-only
example.global = true;
There are two ways to set flags, depending on
which method is used to create the regex. When using the
RegExp constructor, you can pass a second string parameter
that lists the flags for the regex. When using a regular expression
literal, the flags should follow the trailing forward slash that
ends the expression:
// Create a regular expression with the global and ignoreCase flags
var example1:RegExp = new RegExp( "hello", "gi" );
// Create a regular expression with the global and ignoreCase flags
var example2:RegExp = /hello/gi;
By default, all the flags are set to
false unless they are explicitly declared when the regex
is created. Table 13-3 lists the
various flags and their meaning.
Table 13-3. Regular expression flags
Flag |
Meaning |
Example |
g (global)
|
Matches more than one match |
/the/g matches the multiple times |
i (ignoreCase)
|
Performs a case-insensitive match for
[a-z] and [A-Z] (and not special characters like
é) |
/a/i matches a and A |
m (multiline)
|
Allows ^ to match the end of a
line; allows $ to match the beginning of a line |
/^a/m matches both \\na and a |
s (dotall)
|
Allows . to match the newline
character \\n |
/a./s matches both a\\n and ab |
x (extended)
|
Allows spaces in the regex that are
ignored by the pattern, allowing regex to be written more
clearly |
/a \\d/x matches a2 but not a
2 (with a space between the characters) |
The most commonly used flags are
ignoreCase and global, but specifying the
extended flag can help in understanding regexes. With the
extended flag set, you can insert extra whitespace to
highlight the different parts that make up the expression; for
example:
var example1:RegExp = /(a(b)*){2,}/
// Use the extended flag for slightly more readability
var example2:RegExp = /(a (b)* ){2,}/x;
The preceding code creates a regular expression
for "a, followed by b any number of times, with the whole
expression repeated at least 2 times" and matches "abba" and
"abbbabbbbbbbb," but not "abbb."
A key point to remember is that every regex can
be reduced to these fundamental building blocks. Understanding this
and learning how to break down complex regex patterns will help
avoid some of the frustration associated with learning regular
expressions. It's worth your time to learn regular expressions, and
once you've got them down, they'll prove to be a valuable tool to
have on your belt.
See Also
Recipes 9.5
and 12.3.
A good reference for regular expressions can be found at http://www.regular-expressions.info.
See Mastering Regular
Expressions, by Jeffrey Friedl (O'Reilly) for extensive
practice with regular expressions and Regular Expressions Pocket Reference, by Tony
Stubblebine (O'Reilly) for a
quick lookup guide.
|