模式語法

模式語法 -- 解說 Perl 相容正則表達式的語法

說明

PCRE 庫是一組用和 Perl 5 相同的語法和語義實現了正則表達式模式符合的函數,不過有少許區別(見下面)。現用的 PCRE 的實現是與 Perl 5.005 相符的。

與 Perl 的區別

這裡談到的區別是就 Perl 5.005 來說的。

  1. 預設情況下,空白字元是 C 語系庫函數 isspace() 所能識別的任何字元,儘管有可能與別的字元類型表編譯在一起。通常 isspace() 符合空格,換頁符,換行符,換行符,水平製表符和垂直製表符。Perl 5 不再將垂直製表符內含在空白字元中了。事實上長久以來存在於 Perl 文件中的轉義序列 \v 從未被識別過,不過該字元至少到 5.002 為止都被當成空白字元的。在 5.004 和 5.005 中 \s 不符合此字元。

  2. PCRE 不容許在向前斷言中使用重複的數量符。Perl 容許這樣,但可能不是你想像中的含義。例如,(?!a){3} 並不是斷言下面三個字元不是「a」,而是斷言下一個字元不是「a」三次。

  3. 捕捉出現在排除模式斷言中的子模式雖然被計數,但並未在偏移向量中設定其條目。Perl 在符合失敗前從此種模式中設定其數字變量,但只在排觸摸式斷言只包括一個分支時。

  4. 儘管目的字串中支援二進位的零字元,但不能出現在模式字串中,因為它被當作普通的 C 字串傳遞,以二進位零終止。轉義序列「\x00」可以在模式中用來表示二進位零。

  5. 不支援下列 Perl 轉義序列:\l,\u,\L,\U。事實上這些是由 Perl 的字串處理來實現的,並不是模式符合引擎的一部分。

  6. 不支援 Perl 的 \G 斷言,因為這和單個的模式符合無關。

  7. 很明顯,PCRE 不支援 (?{code}) 結構。

  8. 當部分模式重複的時候,有關 Perl 5.005_02 捕捉字串的設定有些古怪的地方。舉例說,用模式 /^(a(b)?)+$/ 去符合 "aba" 會將 $2 設為 "b",但是用模式 /^(aa(bb)?)+$/ 去符合 "aabbaa" 會使 $2 無值。然而,若果把模式改成 /^(aa(b(b))?)+$/,則 $2(和 $3)就有值了。在 Perl 5.004 中以上兩種情況下 $2 都會被賦值,在 PCRE 中也是 TRUE。若果以後 Perl 改了,PCRE 可能也會跟著改。

  9. 另一個未解決的矛盾是 Perl 5.005_02 中模式 /^(a)?(?(1)a|b)+$/ 能符合上字串 "a",但是 PCRE 不會。然而,在 Perl 和 PCRE 中用 /^(a)?a/ 去符合 "a" 會使 $1 沒有值。

  10. PCRE 提供了一些對 Perl 正則表達式機制的增加:

    1. 儘管向後斷言必須符合固定長度字串,但每個向後斷言的分支可以符合不同長度的字串。Perl 5.005 要求所有分支的長度相同。

    2. 若果設定了 PCRE_DOLLAR_ENDONLY 而沒有設定 PCRE_MULTILINE,則 $ 元字元只符合字串的最末尾。

    3. 若果設定了 PCRE_EXTRA,反斜線後面跟一個沒有特殊含義的字母會出錯。

    4. 若果設定了 PCRE_UNGREEDY,則重複的數量符的 greed 被反轉,即,預設時不是 greedy,但若果後面跟上一個問號就變成 greedy 了。

正則表達式詳解

介紹

下面說明 PCRE 所支援的正則表達式的語法和語義。Perl 文件和很多其它書中也解說了正則表達式,有的書中有很多例子。Jeffrey Friedl 寫的「Mastering Regular Expressions」,由 O'Reilly 出版社發行(ISBN 1-56592-257-3),包括了大量細節。這裡的說明只是個參考文件。

正則表達式是從左向右去符合目的字串的一組模式。大多數字元在模式中表示它們自身並符合目的中相應的字元。作為一個小例子,模式 The quick brown fox 符合了目的字串中與其完全相同的一部分。

元字元

正則表達式的威力在於其能夠在模式中包括選取和迴圈。它們通過使用元字元來編碼在模式中,元字元不代表其自身,它們用一些特殊的模式來解析。

有兩組不同的元字元:一種是模式中除了方括號內都能被識別的,還有一種是在方括號內被識別的。方括號之外的元字元有這些:

\

有數種用途的通用轉義符

^

斷言目的的開頭(或在多行模式下行的開頭,即緊隨一換行符之後)

$

斷言目的的結尾(或在多行模式下行的結尾,即緊隨一換行符之前)

.

符合除了換行符外的任意一個字元(預設情況下)

[

字元類定義開始

]

字元類定義結束

|

開始一個多選一的分支

(

子模式開始

)

子模式結束

?

增加 ( 的含義,也是 0 或 1 數量限定符,以及數量限定符最小值

*

符合 0 個或多個的數量限定符

+

符合 1 個或多個的數量限定符

{

最少/最多數量限定開始

}

最少/最多數量限定結束

模式中方括號內的部分稱為「字元類」。字元類中可用的元字元為:

\

通用轉義字元

^

排除字元類,但僅當其為第一個字元時有效

-

指出字元範圍

]

結束字元類

以下說明了每一個元字元的用法。

反斜線(\)

反斜線字元有幾種用途。首先,若果其後跟著一個非字母數字字元,則取消該字元可能具有的任何特殊含義。此種將反斜線用作轉義字元的用法適用於無論是字元類之中還是之外。

例如,若果想符合一個「*」字元,則在模式中用「\*」。這適用於無論下一個字元是否會被當作元字元來解釋,因此在非字母數字字元之前加上一個「\」來指明該字元就代表其本身總是安全的。尤其是,若果要符合一個反斜線,用「\\」。

注: 單引號或雙引號括起來的 PHP 字串中的反斜線有特殊含義。因此必須用正則表達式的 \\ 來符合 \,而在 PHP 代碼中要用 "\\\\" 或 '\\\\'。

若果模式編譯時加上了 PCRE_EXTENDED 選項,模式中的空白字元(字元類中以外的)以及字元類之外的「#」到換行符之間的字元都被忽略。可以用轉義的反斜線將空白字元或是「#」字元內含到模式中去。

反斜線的第二種用途提供了一種在模式中以可見模式去編碼不可列印字元的方法。並沒有不可列印字元出現的限制,除了代表模式結束的二進位零以外。但用文字編輯器來準備模式的時候,通常用以下的轉義序列來表示那些二進位字元更容易一些:

\a

alarm,即 BEL 字元(0x07)

\cx

"control-x",其中 x 是任意字元

\e

escape(0x1B)

\f

換頁符 formfeed(0x0C)

\n

換行符 newline(0x0A)

\r

換行符 carriage return(0x0D)

\t

製表符 tab(0x09)

\xhh

十六進位代碼為 hh 的字元

\ddd

八進位代碼為 ddd 的字元,或 backreference

\cx」的精確效果如下:若果「x」是小寫字母,則被轉換為大寫字母。接著字元中的第 6 位(0x40)被反轉。從而「\cz」成為 0x1A,但「\c{」成為 0x3B,而「\c;」成為 0x7B。

在「\x」之後最多再讀取兩個十六進位數字(其中的字母可以是大寫或小寫)。在 UTF-8 模式下,容許用「\x{...}」,花括號中的內容是表示十六進位數字的字串。原來的十六進位轉義序列 \xhh 若果其值大於 127 的話則符合了一個雙位元組 UTF-8 字元。

在「\0」之後最多再讀取兩個八進位數字。以上兩種情況下,若果少於兩個數字,則只使用已出現的。因此序列「\0\x\07」代表兩個二進位的零加一個 BEL 字元。若果是八進位數字則確保在開始的零後面再提供兩個數字。

處理反斜線後面跟著一個不是 0 的數字比較複雜。在字元類之外,PCRE 以十進位數字讀取該數字及其後面的數字。若果數字小於 10,或是之前表達式中捕捉到至少該數字的左圓括號,則這個序列將被作為逆向引用。有關此如何運作的說明在後面,以及括號內的子模式。

在字元類之中,或是若果十進位數字大於 9 並且之前沒有那麼多捕捉的子模式,PCRE 重新從反斜線開始讀取其後的最多三個八進位數字,並以最低位的 8 個比特產生出一個單一位元組。任何其後的數字都代表自身。例如:

\040

另一種表示空格的方法

\40

同上,若果之前捕捉的子模式少於 40 個的話

\7

總是一個逆向引用

\11

可能是個逆向引用,或是是製表符 tab

\011

總是表示製表符 tab

\0113

表示製表符 tab 後面跟著一個字元「3」

\113

表示八進位代碼為 113 的字元(因為不能超過 99 個逆向引用)

\377

表示一個所有的比特都是 1 的位元組

\81

要麼是一個逆向引用,要麼是一個二進位的零後面跟著兩個字元「8」和「1」

注意八進位值 100 或更大的值之前不能以零打頭,因為不會讀取(反斜線後)超過三個八進位數字。

所有的定義了一個單一位元組的序列可以用於字元類之中或之外。此外,在字元類之中,序列「\b」被解釋為反斜線字元(0x08),而在字元類之外有不同含義(見下面)。

反斜線的第三個用法是指定通用字元類型:

\d

任一十進位數字

\D

任一非十進位數的字元

\s

任一空白字元

\S

任一非空白字元

\w

任一「字」的字元

\W

任一「非字」的字元

任何一個轉義序列將完整的字元組合分割成兩個分離的部分。任一給定的字元符合一個且僅一個轉義序列。

「字」的字元是指任何一個字母或數字或下劃線,也就是說,任何可以是 Perl "word" 的字元。字母和數字的定義由 PCRE 字元表控制,可能會根據指定區功能變數的符合而改變(見上面的「區功能變數支援」)。舉例說,在 "fr" (French) 區功能變數,某些編碼大於 128 的字元用來表示重音字母,這些字元能夠被 \w 所符合。

這些字元類型序列可以出現在字元類之中和之外。每一個符合相應類型中的一個字元。若果現用的符合點在目的字串的結尾,以上所有符合都失敗,因為沒有字元可供符合。

反斜線的第四個用法是某些簡單的斷言。斷言是指在一個符合中的特定位置必須達到的條件,並不會消耗目的字串中的任何字元。子模式中更複雜的斷言的用法在下面描述。反斜線的斷言有:

\b

字分界線

\B

非字分界線

\A

目的的開頭(獨立於多行模式)

\Z

目的的結尾或位於結尾的換行符前(獨立於多行模式)

\z

目的的結尾(獨立於多行模式)

\G

目的中的第一個符合位置

這些斷言可能不能出現在字元類中(但是注意 "\b" 有不同的含義,在字元類之中也就是反斜線字元)。

字邊界是目的字串中的一個位置,其現用的字元和前一個字元不能同時符合 \w 或是 \W(也就是其中一個符合 \w 而另一個符合 \W),或是是字串的開頭或結尾,假如第一個或最後一個字元符合 \w 的話。

\A\Z\z 斷言與傳統的音調符和美元符(下面說明)的不同之處在於它們僅符合目的字串的絕對開頭和結尾而不管設定了任何選項。它們不受 PCRE_NOTBOLPCRE_NOTEOL 選項的影響。\Z\z 的不同之處在於 \Z 符合了作為字串最後一個字元的換行符之前以及字串的結尾,而 \z 僅符合字串的結尾。

The \G assertion is true only when the current matching position is at the start point of the match, as specified by the offset argument of preg_match(). It differs from \A when the value of offset is non-zero. It is available since PHP 4.3.3.

\Q and \E can be used to ignore regexp metacharacters in the pattern since PHP 4.3.3. For example: \w+\Q.$.\E$ will match one or more word characters, followed by literals .$. and anchored at the end of the string.

Unicode character properties

Since PHP 4.4.0 and 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. They are:

\p{xx}

a character with the xx property

\P{xx}

a character without the xx property

\X

an extended Unicode sequence

The property names represented by xx above are limited to the Unicode general category properties. Each character has exactly one such property, specified by a two-letter abbreviation. For compatibility with Perl, negation can be specified by including a circumflex between the opening brace and the property name. For example, \p{^Lu} is the same as \P{Lu}.

If only one letter is specified with \p or \P, it includes all the properties that start with that letter. In this case, in the absence of negation, the curly brackets in the escape sequence are optional; these two examples have the same effect:


      \p{L}
      \pL
     

表格 1. Supported property codes

COther
CcControl
CfFormat
CnUnassigned
CoPrivate use
CsSurrogate
LLetter
LlLower case letter
LmModifier letter
LoOther letter
LtTitle case letter
LuUpper case letter
MMark
McSpacing mark
MeEnclosing mark
MnNon-spacing mark
NNumber
NdDecimal number
NlLetter number
NoOther number
PPunctuation
PcConnector punctuation
PdDash punctuation
PeClose punctuation
PfFinal punctuation
PiInitial punctuation
PoOther punctuation
PsOpen punctuation
SSymbol
ScCurrency symbol
SkModifier symbol
SmMathematical symbol
SoOther symbol
ZSeparator
ZlLine separator
ZpParagraph separator
ZsSpace separator

Extended properties such as "Greek" or "InMusicalSymbols" are not supported by PCRE.

Specifying caseless matching does not affect these escape sequences. For example, \p{Lu} always matches only upper case letters.

The \X escape matches any number of Unicode characters that form an extended Unicode sequence. \X is equivalent to (?>\PM\pM*).

That is, it matches a character without the "mark" property, followed by zero or more characters with the "mark" property, and treats the sequence as an atomic group (see below). Characters with the "mark" property are typically accents that affect the preceding character.

Matching characters by Unicode property is not fast, because PCRE has to search a structure that contains data for over fifteen thousand characters. That is why the traditional escape sequences such as \d and \w do not use Unicode properties in PCRE.

音調符(^)和美元符($)

在字元類之外,預設符合模式下,音調符是一個僅在現用的符合點是目的字串的開頭時才為真的斷言。在字元類之中,音調符的含義完全不同(見下面)。

若果涉及到幾選一時音調符不需要是模式的第一個字元,但若果出現在某個分支中則應該是該選取分支的第一個字元。若果所有的選取分支都以音調符開頭,這就是說,若果模式限制為只符合目的的開頭,那麼這是一個緊固模式。(也有其它結構可以使模式成為緊固的。)

美元符是一個僅在現用的符合點是目的字串的結尾或是當最後一個字元是換行符時其前面的位置時為 TRUE 的斷言(預設情況下)。若果涉及到幾選一時美元符不需要是模式的最後一個字元,但應該是其出現的分支中的最後一個字元。美元符在字元類之中沒有特殊含義。

美元符的含義可被改變使其僅符合字串的結尾,只要在編譯或符合時設定了 PCRE_DOLLAR_ENDONLY 選項即可。這並不影響 \Z 斷言。

若果設定了 PCRE_MULTILINE 選項則音調符和美元符的含義被改變了。此種情況下,它們分別符合緊接著內定 "\n" 字元的之後和之前,再加上目的字串的開頭和結尾。例如模式 /^abc$/ 在多行模式下符合了目的字串 "def\nabc",但標準時不符合。因此,由於所有分支都以 "^" 開頭而在單行模式下成為緊固的模式在多行模式下為非緊固的。若果設定了 PCRE_MULTILINE,則 PCRE_DOLLAR_ENDONLY 選項會被忽略。

注意 \A,\Z 和 \z 序列在兩種情況下都可以用來符合目的的開頭和結尾,若果模式所有的分支都以 \A 開始則其總是緊固的,不論是否設定了 PCRE_MULTILINE

句號(.)

在字元類之外,模式中的圓點可以符合目的中的任何一個字元,內含不可列印字元,但不符合換行符(預設情況下)。若果設定了 PCRE_DOTALL 則圓點也會符合換行符。處理圓點與處理音調符和美元符是完全獨立的,唯一的聯繫就是它們都涉及到換行符。圓點在字元類之中沒有特殊含義。

\C 可以用來符合單一位元組。在 UTF-8 模式下這有意義,因為句號可以符合由多個位元組組成的整個字元。

方括號([])

左方括號開始了一個字元類,右方括號結束之。單獨一個右方括號不是特殊字元。若果在字元類之中需要一個右方括號,則其應該是字元類中的第一個字元(若果有音調符的話,則緊接音調符之後),或是用反斜線轉義。

字元類符合目的中的一個字元,該字元必須是字元類定義的字集中的一個;除非字元類中的第一個字元是音調符,此情況下目的字元必須不在字元類定義的字集中。若果在字元類中需要音調符本身,則其必須不是第一個字元,或用反斜線轉義。

舉例說,字元類 [aeiou] 符合了任何一個小寫元音字母,而 [^aeiou] 符合了任何一個不是小寫元音字母的字元。注意音調符只是一個通過枚舉指定那些不在字元類之中的字元的符號。不是斷言:仍舊會消耗掉目的字串中的一個字元,若果現用的位置在字串結尾的話則失敗。

當設定了不區分大小寫的符合時,字元類中的任何字母同時代表了其大小寫形式,因此舉例說,小寫的 [aeiou] 同時符合了 "A" 和 "a",小寫的 [^aeiou] 不符合 "A",但區分大小寫時則會符合。

換行符在字元類中不會特殊對待,不論 PCRE_DOTALL 或是 PCRE_MULTILINE 選項設定了什麼值。形如 [^a] 的字元類總是能夠和換行符相符合的。

減號(-)字元可以在字元類中指定一個字元範圍。例如,[d-m] 符合了 d 和 m 之間的任何字元,內含兩者。若果字元類中需要減號本身,則必須用反斜線轉義或是放到一個不能被解釋為指定範圍的位置,典型的位置是字元類中的第一個或最後一個字元。

字面上的 "]" 不可能被當成字元範圍的結束。形如 [W-]46] 的模式會被解釋為內含兩個字元的字元類("W" and "-")後面跟著字串 "46]",因此其會符合 "W46]" 或是 "-46]"。然而,若果將 "]" 用反斜線轉義,則會被當成範圍的結束來解釋。因此 [W-\]46] 會被解釋為一個字元類,包括有一個範圍以及兩個單獨的字元。八進位或十六進位表示的 "]" 也可以用來表示範圍的結束。

範圍是以 ASCII 比較順序來動作的。也可以用於用數字表示的字元,例如 [\000-\037]。在不區分大小寫符合中若果範圍裡內含了字母,則同時符合大小寫字母。例如 [W-c] 等價於 [][\^_`wxyzabc] 不區分大小寫地符合。若果使用了 "fr" 區功能變數的字元表,[\xc8-\xcb] 符合了大小寫的重音 E 字元。

字元類型 \d,\D,\s,\S,\w 和 \W 也可以出現於字元類中,並將其所能符合的字元增加進字元類中。例如,[\dABCDEF] 符合了任何十六進位數字。用音調符可以很方便地制定嚴格的字集,例如 [^\W_] 符合了任何字母或數字,但不符合下劃線。

任何除了 \,-,^(位於開頭)以及結束的 ] 之外的非字母數字字元在字元類中都沒有特殊含義,但是將它們轉義也沒有壞處。

豎線(|)

豎線字元用來分隔多選一模式。例如,模式:
gilbert|sullivan
符合了 "gilbert" 或是 "sullivan" 中的一個。可以有任意多個分支,也可以有空的分支(符合空字串)。符合程式從左到右輪流嘗試每個分支,並使用第一個成功符合的分支。若果分支在子模式(在下面定義)中,則「成功符合」表示同時符合了子模式中的分支以及主模式的其它部分。

內定選項設定

PCRE_CASELESSPCRE_MULTILINEPCRE_DOTALLPCRE_EXTRAPCRE_EXTENDED 的設定可以在模式內定通過包括在 "(?" 和 ")" 之間的 Perl 選項字母序列來改變。選項字母為:

表格 2. 內定選項字母

i代表 PCRE_CASELESS
m代表 PCRE_MULTILINE
s代表 PCRE_DOTALL
x代表 PCRE_EXTENDED
U代表 PCRE_UNGREEDY
X代表 PCRE_EXTRA

例如,(?im) 設定了不區分大小寫,多行符合。也可以通過在字母前加上減號來取消這些選項。例如組合的選項 (?im-sx),設定了 PCRE_CASELESSPCRE_MULTILINE,並取消了 PCRE_DOTALLPCRE_EXTENDED。若果一個字母在減號之前與之後都出現了,則該選項被取消設定。

若果選項改變出現於頂層(即不在子模式的括號中),則改變套用於其後的剩餘模式。因此 /ab(?i)c/ 只符合 "abc" 和 and "abC"。此行為是自 PHP 4.3.3 起綁定的 PCRE 4.0 中被修改的。在此版本之前 /ab(?i)c/ 的執行與 /abc/i 相同(例如符合 "ABC" 和 "aBc")。

若果選項改變出現於子模式中,則效果不同。這是 Perl 5.005 的行為的一個變化。子模式中的選項改變只影響到子模式內定其後的部分,因此 (a(?i)b)c 將只符合 "abc" 和 "aBc"(假定沒有使用 PCRE_CASELESS)。這意味著選項在模式的不同部位可以造成不同的設定。在一個分支中的改變可以傳遞到同一個子模式中後面的分支中,例如 (a(?i)b|c) 將符合 "ab","aB","c" 和 "C",儘管在符合 "C" 的時候第一個分支會在選項設定之前就被丟棄。這是因為選項設定的效果是在編譯時確定的,否則會造成非常怪異的行為。

PCRE 私人選項 PCRE_UNGREEDYPCRE_EXTRA 可以和 Perl 相容選項以同樣的模式來改變,分別使用字母 U 和 X。(?X) 旗標設定有些特殊,它必須出現於任何其它特性之前。最好放在最開頭的位置。

子模式

子模式由圓括號定界,可以嵌套。將模式中的一部分旗標為子模式可以:

1. 將多選一的分支局部化。例如,模式:
cat(aract|erpillar|)
符合了 "cat","cataract" 或 "caterpillar" 之一,沒有圓括號的話將符合 "cataract","erpillar" 或空字串。

2. 將子模式設定為捕捉子模式(如同以前定義的)。當整個模式符合時,目的字串中符合了子模式的部分會通過 pcre_exec()ovector 參數傳遞回呼用者。左圓括號從左到右計數(從 1 開始)以取得捕捉子模式的數目。

例如,若果將字串 "the red king" 來和模式
the ((red|white) (king|queen))
進行符合,捕捉的子串為 "red king","red" 以及 "king",並被計為 1,2 和 3。

The fact that plain parentheses fulfil two functions is not always helpful. There are often times when a grouping subpattern is required without a capturing requirement. If an opening parenthesis is followed by "?:", the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns. For example, if the string "the white queen" is matched against the pattern the ((?:red|white) (king|queen)) the captured substrings are "white queen" and "queen", and are numbered 1 and 2. The maximum number of captured substrings is 99, and the maximum number of all subpatterns, both capturing and non-capturing, is 200.

As a convenient shorthand, if any option settings are required at the start of a non-capturing subpattern, the option letters may appear between the "?" and the ":". Thus the two patterns


       (?i:saturday|sunday)
       (?:(?i)saturday|sunday)
    

match exactly the same set of strings. Because alternative branches are tried from left to right, and options are not reset until the end of the subpattern is reached, an option setting in one branch does affect subsequent branches, so the above patterns match "SUNDAY" as well as "Saturday".

It is possible to name the subpattern with (?P<name>pattern) since PHP 4.3.3. Array with matches will contain the match indexed by the string alongside the match indexed by a number, then.

Repetition

Repetition is specified by quantifiers, which can follow any of the following items:

  • a single character, possibly escaped

  • the . metacharacter

  • a character class

  • a back reference (see next section)

  • a parenthesized subpattern (unless it is an assertion - see below)

The general repetition quantifier specifies a minimum and maximum number of permitted matches, by giving the two numbers in curly brackets (braces), separated by a comma. The numbers must be less than 65536, and the first must be less than or equal to the second. For example: z{2,4} matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special character. If the second number is omitted, but the comma is present, there is no upper limit; if the second number and the comma are both omitted, the quantifier specifies an exact number of required matches. Thus [aeiou]{3,} matches at least 3 successive vowels, but may match many more, while \d{8} matches exactly 8 digits. An opening curly bracket that appears in a position where a quantifier is not allowed, or one that does not match the syntax of a quantifier, is taken as a literal character. For example, {,6} is not a quantifier, but a literal string of four characters.

The quantifier {0} is permitted, causing the expression to behave as if the previous item and the quantifier were not present.

For convenience (and historical compatibility) the three most common quantifiers have single-character abbreviations:

表格 3. Single-character quantifiers

*equivalent to {0,}
+equivalent to {1,}
?equivalent to {0,1}

It is possible to construct infinite loops by following a subpattern that can match no characters with a quantifier that has no upper limit, for example: (a?)*

Earlier versions of Perl and PCRE used to give an error at compile time for such patterns. However, because there are cases where this can be useful, such patterns are now accepted, but if any repetition of the subpattern does in fact match no characters, the loop is forcibly broken.

By default, the quantifiers are "greedy", that is, they match as much as possible (up to the maximum number of permitted times), without causing the rest of the pattern to fail. The classic example of where this gives problems is in trying to match comments in C programs. These appear between the sequences /* and */ and within the sequence, individual * and / characters may appear. An attempt to match C comments by applying the pattern /\*.*\*/ to the string /* first command */ not comment /* second comment */ fails, because it matches the entire string due to the greediness of the .* item.

However, if a quantifier is followed by a question mark, then it ceases to be greedy, and instead matches the minimum number of times possible, so the pattern /\*.*?\*/ does the right thing with the C comments. The meaning of the various quantifiers is not otherwise changed, just the preferred number of matches. Do not confuse this use of question mark with its use as a quantifier in its own right. Because it has two uses, it can sometimes appear doubled, as in \d??\d which matches one digit by preference, but can match two if that is the only way the rest of the pattern matches.

If the PCRE_UNGREEDY option is set (an option which is not available in Perl) then the quantifiers are not greedy by default, but individual ones can be made greedy by following them with a question mark. In other words, it inverts the default behaviour.

Quantifiers followed by + are "possessive". They eat as many characters as possible and don't return to match the rest of the pattern. Thus .*abc matches "aabc" but .*+abc doesn't because .*+ eats the whole string. Possessive quantifiers can be used to speed up processing since PHP 4.3.3.

When a parenthesized subpattern is quantified with a minimum repeat count that is greater than 1 or with a limited maximum, more store is required for the compiled pattern, in proportion to the size of the minimum or maximum.

If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent to Perl's /s) is set, thus allowing the . to match newlines, then the pattern is implicitly anchored, because whatever follows will be tried against every character position in the subject string, so there is no point in retrying the overall match at any position after the first. PCRE treats such a pattern as though it were preceded by \A. In cases where it is known that the subject string contains no newlines, it is worth setting PCRE_DOTALL when the pattern begins with .* in order to obtain this optimization, or alternatively using ^ to indicate anchoring explicitly.

When a capturing subpattern is repeated, the value captured is the substring that matched the final iteration. For example, after (tweedle[dume]{3}\s*)+ has matched "tweedledum tweedledee" the value of the captured substring is "tweedledee". However, if there are nested capturing subpatterns, the corresponding captured values may have been set in previous iterations. For example, after /(a|(b))+/ matches "aba" the value of the second captured substring is "b".

Back references

Outside a character class, a backslash followed by a digit greater than 0 (and possibly further digits) is a back reference to a capturing subpattern earlier (i.e. to its left) in the pattern, provided there have been that many previous capturing left parentheses.

However, if the decimal number following the backslash is less than 10, it is always taken as a back reference, and causes an error only if there are not that many capturing left parentheses in the entire pattern. In other words, the parentheses that are referenced need not be to the left of the reference for numbers less than 10. See the section entitled "Backslash" above for further details of the handling of digits following a backslash.

A back reference matches whatever actually matched the capturing subpattern in the current subject string, rather than anything matching the subpattern itself. So the pattern (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If caseful matching is in force at the time of the back reference, then the case of letters is relevant. For example, ((?i)rah)\s+\1 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original capturing subpattern is matched caselessly.

There may be more than one back reference to the same subpattern. If a subpattern has not actually been used in a particular match, then any back references to it always fail. For example, the pattern (a|(bc))\2 always fails if it starts to match "a" rather than "bc". Because there may be up to 99 back references, all digits following the backslash are taken as part of a potential back reference number. If the pattern continues with a digit character, then some delimiter must be used to terminate the back reference. If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty comment can be used.

A back reference that occurs inside the parentheses to which it refers fails when the subpattern is first used, so, for example, (a\1) never matches. However, such references can be useful inside repeated subpatterns. For example, the pattern (a|b\1)+ matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of the subpattern, the back reference matches the character string corresponding to the previous iteration. In order for this to work, the pattern must be such that the first iteration does not need to match the back reference. This can be done using alternation, as in the example above, or by a quantifier with a minimum of zero.

Assertions

An assertion is a test on the characters following or preceding the current matching point that does not actually consume any characters. The simple assertions coded as \b, \B, \A, \Z, \z, ^ and $ are described above. More complicated assertions are coded as subpatterns. There are two kinds: those that look ahead of the current position in the subject string, and those that look behind it.

An assertion subpattern is matched in the normal way, except that it does not cause the current matching position to be changed. Lookahead assertions start with (?= for positive assertions and (?! for negative assertions. For example, \w+(?=;) matches a word followed by a semicolon, but does not include the semicolon in the match, and foo(?!bar) matches any occurrence of "foo" that is not followed by "bar". Note that the apparently similar pattern (?!foo)bar does not find an occurrence of "bar" that is preceded by something other than "foo"; it finds any occurrence of "bar" whatsoever, because the assertion (?!foo) is always TRUE when the next three characters are "bar". A lookbehind assertion is needed to achieve this effect.

Lookbehind assertions start with (?<= for positive assertions and (?<! for negative assertions. For example, (?<!foo)bar does find an occurrence of "bar" that is not preceded by "foo". The contents of a lookbehind assertion are restricted such that all the strings it matches must have a fixed length. However, if there are several alternatives, they do not all have to have the same fixed length. Thus (?<=bullock|donkey) is permitted, but (?<!dogs?|cats?) causes an error at compile time. Branches that match different length strings are permitted only at the top level of a lookbehind assertion. This is an extension compared with Perl 5.005, which requires all branches to match the same length of string. An assertion such as (?<=ab(c|de)) is not permitted, because its single top-level branch can match two different lengths, but it is acceptable if rewritten to use two top-level branches: (?<=abc|abde) The implementation of lookbehind assertions is, for each alternative, to temporarily move the current position back by the fixed width and then try to match. If there are insufficient characters before the current position, the match is deemed to fail. Lookbehinds in conjunction with once-only subpatterns can be particularly useful for matching at the ends of strings; an example is given at the end of the section on once-only subpatterns.

Several assertions (of any sort) may occur in succession. For example, (?<=\d{3})(?<!999)foo matches "foo" preceded by three digits that are not "999". Notice that each of the assertions is applied independently at the same point in the subject string. First there is a check that the previous three characters are all digits, then there is a check that the same three characters are not "999". This pattern does not match "foo" preceded by six characters, the first of which are digits and the last three of which are not "999". For example, it doesn't match "123abcfoo". A pattern to do that is (?<=\d{3}...)(?<!999)foo

This time the first assertion looks at the preceding six characters, checking that the first three are digits, and then the second assertion checks that the preceding three characters are not "999".

Assertions can be nested in any combination. For example, (?<=(?<!foo)bar)baz matches an occurrence of "baz" that is preceded by "bar" which in turn is not preceded by "foo", while (?<=\d{3}...(?<!999))foo is another pattern which matches "foo" preceded by three digits and any three characters that are not "999".

Assertion subpatterns are not capturing subpatterns, and may not be repeated, because it makes no sense to assert the same thing several times. If any kind of assertion contains capturing subpatterns within it, these are counted for the purposes of numbering the capturing subpatterns in the whole pattern. However, substring capturing is carried out only for positive assertions, because it does not make sense for negative assertions.

Assertions count towards the maximum of 200 parenthesized subpatterns.

Once-only subpatterns

With both maximizing and minimizing repetition, failure of what follows normally causes the repeated item to be re-evaluated to see if a different number of repeats allows the rest of the pattern to match. Sometimes it is useful to prevent this, either to change the nature of the match, or to cause it fail earlier than it otherwise might, when the author of the pattern knows there is no point in carrying on.

Consider, for example, the pattern \d+foo when applied to the subject line 123456bar

After matching all 6 digits and then failing to match "foo", the normal action of the matcher is to try again with only 5 digits matching the \d+ item, and then with 4, and so on, before ultimately failing. Once-only subpatterns provide the means for specifying that once a portion of the pattern has matched, it is not to be re-evaluated in this way, so the matcher would give up immediately on failing to match "foo" the first time. The notation is another kind of special parenthesis, starting with (?> as in this example: (?>\d+)bar

This kind of parenthesis "locks up" the part of the pattern it contains once it has matched, and a failure further into the pattern is prevented from backtracking into it. Backtracking past it to previous items, however, works as normal.

An alternative description is that a subpattern of this type matches the string of characters that an identical standalone pattern would match, if anchored at the current point in the subject string.

Once-only subpatterns are not capturing subpatterns. Simple cases such as the above example can be thought of as a maximizing repeat that must swallow everything it can. So, while both \d+ and \d+? are prepared to adjust the number of digits they match in order to make the rest of the pattern match, (?>\d+) can only match an entire sequence of digits.

This construction can of course contain arbitrarily complicated subpatterns, and it can be nested.

Once-only subpatterns can be used in conjunction with look-behind assertions to specify efficient matching at the end of the subject string. Consider a simple pattern such as abcd$ when applied to a long string which does not match. Because matching proceeds from left to right, PCRE will look for each "a" in the subject and then see if what follows matches the rest of the pattern. If the pattern is specified as ^.*abcd$ then the initial .* matches the entire string at first, but when this fails (because there is no following "a"), it backtracks to match all but the last character, then all but the last two characters, and so on. Once again the search for "a" covers the entire string, from right to left, so we are no better off. However, if the pattern is written as ^(?>.*)(?<=abcd) then there can be no backtracking for the .* item; it can match only the entire string. The subsequent lookbehind assertion does a single test on the last four characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time.

When a pattern contains an unlimited repeat inside a subpattern that can itself be repeated an unlimited number of times, the use of a once-only subpattern is the only way to avoid some failing matches taking a very long time indeed. The pattern (\D+|<\d+>)*[!?] matches an unlimited number of substrings that either consist of non-digits, or digits enclosed in <>, followed by either ! or ?. When it matches, it runs quickly. However, if it is applied to aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa it takes a long time before reporting failure. This is because the string can be divided between the two repeats in a large number of ways, and all have to be tried. (The example used [!?] rather than a single character at the end, because both PCRE and Perl have an optimization that allows for fast failure when a single character is used. They remember the last single character that is required for a match, and fail early if it is not present in the string.) If the pattern is changed to ((?>\D+)|<\d+>)*[!?] sequences of non-digits cannot be broken, and failure happens quickly.

Conditional subpatterns

It is possible to cause the matching process to obey a subpattern conditionally or to choose between two alternative subpatterns, depending on the result of an assertion, or whether a previous capturing subpattern matched or not. The two possible forms of conditional subpattern are


       (?(condition)yes-pattern)
       (?(condition)yes-pattern|no-pattern)
    

If the condition is satisfied, the yes-pattern is used; otherwise the no-pattern (if present) is used. If there are more than two alternatives in the subpattern, a compile-time error occurs.

There are two kinds of condition. If the text between the parentheses consists of a sequence of digits, then the condition is satisfied if the capturing subpattern of that number has previously matched. Consider the following pattern, which contains non-significant white space to make it more readable (assume the PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: ( \( )? [^()]+ (?(1) \) )

The first part matches an optional opening parenthesis, and if that character is present, sets it as the first captured substring. The second part matches one or more characters that are not parentheses. The third part is a conditional subpattern that tests whether the first set of parentheses matched or not. If they did, that is, if subject started with an opening parenthesis, the condition is TRUE, and so the yes-pattern is executed and a closing parenthesis is required. Otherwise, since no-pattern is not present, the subpattern matches nothing. In other words, this pattern matches a sequence of non-parentheses, optionally enclosed in parentheses.

If the condition is the string (R), it is satisfied if a recursive call to the pattern or subpattern has been made. At "top level", the condition is false.

If the condition is not a sequence of digits or (R), it must be an assertion. This may be a positive or negative lookahead or lookbehind assertion. Consider this pattern, again containing non-significant white space, and with the two alternatives on the second line:


       (?(?=[^a-z]*[a-z])
       \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
    

The condition is a positive lookahead assertion that matches an optional sequence of non-letters followed by a letter. In other words, it tests for the presence of at least one letter in the subject. If a letter is found, the subject is matched against the first alternative; otherwise it is matched against the second. This pattern matches strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.

Comments

The sequence (?# marks the start of a comment which continues up to the next closing parenthesis. Nested parentheses are not permitted. The characters that make up a comment play no part in the pattern matching at all.

If the PCRE_EXTENDED option is set, an unescaped # character outside a character class introduces a comment that continues up to the next newline character in the pattern.

Recursive patterns

Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an experimental facility that allows regular expressions to recurse (among other things). The special item (?R) is provided for the specific case of recursion. This PCRE pattern solves the parentheses problem (assume the PCRE_EXTENDED option is set so that white space is ignored): \( ( (?>[^()]+) | (?R) )* \)

First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself (i.e. a correctly parenthesized substring). Finally there is a closing parenthesis.

This particular example pattern contains nested unlimited repeats, and so the use of a once-only subpattern for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when it is applied to (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() it yields "no match" quickly. However, if a once-only subpattern is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported.

The values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern value is set. If the pattern above is matched against (ab(cd)ef) the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If additional parentheses are added, giving \( ( ( (?>[^()]+) | (?R) )* ) \) then the string they capture is "ab(cd)ef", the contents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a recursion, which it does by using pcre_malloc, freeing it via pcre_free afterwards. If no memory can be obtained, it saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion.

Since PHP 4.3.3, (?1), (?2) and so on can be used for recursive subpatterns too. It is also possible to use named subpatterns: (?P>foo).

If the syntax for a recursive subpattern reference (either by number or by name) is used outside the parentheses to which it refers, it operates like a subroutine in a programming language. An earlier example pointed out that the pattern (sens|respons)e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern (sens|respons)e and (?1)ibility is used, it does match "sense and responsibility" as well as the other two strings. Such references must, however, follow the subpattern to which they refer.

Performances

Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives such as (a|e|i|o|u). In general, the simplest construction that provides the required behaviour is usually the most efficient. Jeffrey Friedl's book contains a lot of discussion about optimizing regular expressions for efficient performance.

When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is implicitly anchored by PCRE, since it can match only at the start of a subject string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, because the . metacharacter does not then match a newline, and if the subject string contains newlines, the pattern may match from the character immediately following one of them instead of from the very start. For example, the pattern (.*) second matches the subject "first\nand second" (where \n stands for a newline character) with the first captured substring being "and". In order to do this, PCRE has to retry the match starting after every newline in the subject.

If you are using such a pattern with subject strings that do not contain newlines, the best performance is obtained by setting PCRE_DOTALL, or starting the pattern with ^.* to indicate explicit anchoring. That saves PCRE from having to scan along the subject looking for a newline to restart at.

Beware of patterns that contain nested indefinite repeats. These can take a long time to run when applied to a string that does not match. Consider the pattern fragment (a+)*

This can match "aaaa" in 33 different ways, and this number increases very rapidly as the string gets longer. (The * repeat can match 0, 1, 2, 3, or 4 times, and for each of those cases other than 0, the + repeats can match different numbers of times.) When the remainder of the pattern is such that the entire match is going to fail, PCRE has in principle to try every possible variation, and this can take an extremely long time.

An optimization catches some of the more simple cases such as (a+)*b where a literal character follows. Before embarking on the standard matching procedure, PCRE checks that there is a "b" later in the subject string, and if there is not, it fails the match immediately. However, when there is no following literal this optimization cannot be used. You can see the difference by comparing the behaviour of (a+)*\d with the pattern above. The former gives a failure almost instantly when applied to a whole line of "a" characters, whereas the latter takes an appreciable time with strings longer than about 20 characters.