Start

author: Philip Hazel <ph10@hermes.cam.ac.uk> 2004-10-07 15:04:35 +0000
committer: Philip Hazel <ph10@hermes.cam.ac.uk> 2004-10-07 15:04:35 +0000
commit: 495ae4b01f36d0d8bb0e34a1d7263c2b8224aa4a (patch)
tree: fcfaa2c623d4f155eef907b50b950b602829a30b /doc/doc-txt/pcrepattern.txt
parent: 0756eb3cb50d73a77b486e47528f7cb1bffdb299 (diff)
1 files changed, 1413 insertions, 0 deletions
diff --git a/doc/doc-txt/pcrepattern.txt b/doc/doc-txt/pcrepattern.txt
new file mode 100644
index 000000000..1dc800af4
--- /dev/null
+++ b/doc/doc-txt/pcrepattern.txt
@@ -0,0 +1,1413 @@
+This file contains the PCRE man page that describes the regular expressions 
+supported by PCRE version 5.0. Note that not all of the features are relevant 
+in the context of Exim. In particular, the version of PCRE that is compiled
+with Exim does not include UTF-8 support, there is no mechanism for changing
+the options with which the PCRE functions are called, and features such as
+callout are not accessible.
+-----------------------------------------------------------------------------
+
+PCRE(3)                                                                PCRE(3)
+
+
+
+NAME
+       PCRE - Perl-compatible regular expressions
+
+PCRE REGULAR EXPRESSION DETAILS
+
+       The  syntax  and semantics of the regular expressions supported by PCRE
+       are described below. Regular expressions are also described in the Perl
+       documentation  and  in  a  number  of books, some of which have copious
+       examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published
+       by  O'Reilly, covers regular expressions in great detail. This descrip-
+       tion of PCRE's regular expressions is intended as reference material.
+
+       The original operation of PCRE was on strings of  one-byte  characters.
+       However,  there is now also support for UTF-8 character strings. To use
+       this, you must build PCRE to  include  UTF-8  support,  and  then  call
+       pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
+       matching is mentioned in several places below. There is also a  summary
+       of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
+       page.
+
+       A regular expression is a pattern that is  matched  against  a  subject
+       string  from  left  to right. Most characters stand for themselves in a
+       pattern, and match the corresponding characters in the  subject.  As  a
+       trivial example, the pattern
+
+         The quick brown fox
+
+       matches  a portion of a subject string that is identical to itself. The
+       power of regular expressions comes from the ability to include alterna-
+       tives  and repetitions in the pattern. These are encoded in the pattern
+       by the use of metacharacters, which do not  stand  for  themselves  but
+       instead are interpreted in some special way.
+
+       There  are  two different sets of metacharacters: those that are recog-
+       nized anywhere in the pattern except within square brackets, and  those
+       that  are  recognized  in square brackets. Outside square brackets, the
+       metacharacters are as follows:
+
+         \      general escape character with several uses
+         ^      assert start of string (or line, in multiline mode)
+         $      assert end of string (or line, in multiline mode)
+         .      match any character except newline (by default)
+         [      start character class definition
+         |      start of alternative branch
+         (      start subpattern
+         )      end subpattern
+         ?      extends the meaning of (
+                also 0 or 1 quantifier
+                also quantifier minimizer
+         *      0 or more quantifier
+         +      1 or more quantifier
+                also "possessive quantifier"
+         {      start min/max quantifier
+
+       Part of a pattern that is in square brackets  is  called  a  "character
+       class". In a character class the only metacharacters are:
+
+         \      general escape character
+         ^      negate the class, but only if the first character
+         -      indicates character range
+         [      POSIX character class (only if followed by POSIX
+                  syntax)
+         ]      terminates the character class
+
+       The  following sections describe the use of each of the metacharacters.
+
+
+BACKSLASH
+
+       The backslash character has several uses. Firstly, if it is followed by
+       a  non-alphanumeric  character,  it takes away any special meaning that
+       character may have. This  use  of  backslash  as  an  escape  character
+       applies both inside and outside character classes.
+
+       For  example,  if  you want to match a * character, you write \* in the
+       pattern.  This escaping action applies whether  or  not  the  following
+       character  would  otherwise be interpreted as a metacharacter, so it is
+       always safe to precede a non-alphanumeric  with  backslash  to  specify
+       that  it stands for itself. In particular, if you want to match a back-
+       slash, you write \\.
+
+       If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
+       the  pattern (other than in a character class) and characters between a
+       # outside a character class and the next newline character are ignored.
+       An  escaping backslash can be used to include a whitespace or # charac-
+       ter as part of the pattern.
+
+       If you want to remove the special meaning from a  sequence  of  charac-
+       ters,  you can do so by putting them between \Q and \E. This is differ-
+       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
+       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
+       tion. Note the following examples:
+
+         Pattern            PCRE matches   Perl matches
+
+         \Qabc$xyz\E        abc$xyz        abc followed by the
+                                             contents of $xyz
+         \Qabc\$xyz\E       abc\$xyz       abc\$xyz
+         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
+
+       The \Q...\E sequence is recognized both inside  and  outside  character
+       classes.
+
+   Non-printing characters
+
+       A second use of backslash provides a way of encoding non-printing char-
+       acters in patterns in a visible manner. There is no restriction on  the
+       appearance  of non-printing characters, apart from the binary zero that
+       terminates a pattern, but when a pattern  is  being  prepared  by  text
+       editing,  it  is  usually  easier  to  use  one of the following escape
+       sequences than the binary character it represents:
+
+         \a        alarm, that is, the BEL character (hex 07)
+         \cx       "control-x", where x is any character
+         \e        escape (hex 1B)
+         \f        formfeed (hex 0C)
+         \n        newline (hex 0A)
+         \r        carriage return (hex 0D)
+         \t        tab (hex 09)
+         \ddd      character with octal code ddd, or backreference
+         \xhh      character with hex code hh
+         \x{hhh..} character with hex code hhh... (UTF-8 mode only)
+
+       The precise effect of \cx is as follows: if x is a lower  case  letter,
+       it  is converted to upper case. Then bit 6 of the character (hex 40) is
+       inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
+       becomes hex 7B.
+
+       After  \x, from zero to two hexadecimal digits are read (letters can be
+       in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-
+       its  may  appear between \x{ and }, but the value of the character code
+       must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is
+       7FFFFFFF).  If  characters other than hexadecimal digits appear between
+       \x{ and }, or if there is no terminating }, this form of escape is  not
+       recognized. Instead, the initial \x will be interpreted as a basic hex-
+       adecimal escape, with no following digits,  giving  a  character  whose
+       value is zero.
+
+       Characters whose value is less than 256 can be defined by either of the
+       two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference
+       in  the  way they are handled. For example, \xdc is exactly the same as
+       \x{dc}.
+
+       After \0 up to two further octal digits are read.  In  both  cases,  if
+       there  are fewer than two digits, just those that are present are used.
+       Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL
+       character  (code  value  7).  Make sure you supply two digits after the
+       initial zero if the pattern character that follows is itself  an  octal
+       digit.
+
+       The handling of a backslash followed by a digit other than 0 is compli-
+       cated.  Outside a character class, PCRE reads it and any following dig-
+       its  as  a  decimal  number. If the number is less than 10, or if there
+       have been at least that many previous capturing left parentheses in the
+       expression,  the  entire  sequence  is  taken  as  a  back reference. A
+       description of how this works is given later, following the  discussion
+       of parenthesized subpatterns.
+
+       Inside  a  character  class, or if the decimal number is greater than 9
+       and there have not been that many capturing subpatterns, PCRE  re-reads
+       up  to three octal digits following the backslash, and generates a sin-
+       gle byte from the least significant 8 bits of the value. Any subsequent
+       digits stand for themselves.  For example:
+
+         \040   is another way of writing a space
+         \40    is the same, provided there are fewer than 40
+                   previous capturing subpatterns
+         \7     is always a back reference
+         \11    might be a back reference, or another way of
+                   writing a tab
+         \011   is always a tab
+         \0113  is a tab followed by the character "3"
+         \113   might be a back reference, otherwise the
+                   character with octal code 113
+         \377   might be a back reference, otherwise
+                   the byte consisting entirely of 1 bits
+         \81    is either a back reference, or a binary zero
+                   followed by the two characters "8" and "1"
+
+       Note  that  octal  values of 100 or greater must not be introduced by a
+       leading zero, because no more than three octal digits are ever read.
+
+       All the sequences that define a single byte value  or  a  single  UTF-8
+       character (in UTF-8 mode) can be used both inside and outside character
+       classes. In addition, inside a character  class,  the  sequence  \b  is
+       interpreted as the backspace character (hex 08), and the sequence \X is
+       interpreted as the character "X".  Outside  a  character  class,  these
+       sequences have different meanings (see below).
+
+   Generic character types
+
+       The  third  use of backslash is for specifying generic character types.
+       The following are always recognized:
+
+         \d     any decimal digit
+         \D     any character that is not a decimal digit
+         \s     any whitespace character
+         \S     any character that is not a whitespace character
+         \w     any "word" character
+         \W     any "non-word" character
+
+       Each pair of escape sequences partitions the complete set of characters
+       into  two disjoint sets. Any given character matches one, and only one,
+       of each pair.
+
+       These character type sequences can appear both inside and outside char-
+       acter  classes.  They each match one character of the appropriate type.
+       If the current matching point is at the end of the subject string,  all
+       of them fail, since there is no character to match.
+
+       For  compatibility  with Perl, \s does not match the VT character (code
+       11).  This makes it different from the the POSIX "space" class. The  \s
+       characters are HT (9), LF (10), FF (12), CR (13), and space (32).
+
+       A "word" character is an underscore or any character less than 256 that
+       is a letter or digit. The definition of  letters  and  digits  is  con-
+       trolled  by PCRE's low-valued character tables, and may vary if locale-
+       specific matching is taking place (see "Locale support" in the  pcreapi
+       page).  For  example,  in  the  "fr_FR" (French) locale, some character
+       codes greater than 128 are used for accented  letters,  and  these  are
+       matched by \w.
+
+       In  UTF-8 mode, characters with values greater than 128 never match \d,
+       \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
+       code character property support is available.
+
+   Unicode character properties
+
+       When PCRE is built with Unicode character property support, three addi-
+       tional escape sequences to match generic character types are  available
+       when UTF-8 mode is selected. They are:
+
+        \p{xx}   a character with the xx property
+        \P{xx}   a character without the xx property
+        \X       an extended Unicode sequence
+
+       The  property  names represented by xx above are limited to the Unicode
+       general category properties. Each character has exactly one such  prop-
+       erty,  specified  by  a two-letter abbreviation. For compatibility with
+       Perl, negation can be specified by including a circumflex  between  the
+       opening  brace  and the property name. For example, \p{^Lu} is the same
+       as \P{Lu}.
+
+       If only one letter is specified with \p or  \P,  it  includes  all  the
+       properties that start with that letter. In this case, in the absence of
+       negation, the curly brackets in the escape sequence are optional; these
+       two examples have the same effect:
+
+         \p{L}
+         \pL
+
+       The following property codes are supported:
+
+         C     Other
+         Cc    Control
+         Cf    Format
+         Cn    Unassigned
+         Co    Private use
+         Cs    Surrogate
+
+         L     Letter
+         Ll    Lower case letter
+         Lm    Modifier letter
+         Lo    Other letter
+         Lt    Title case letter
+         Lu    Upper case letter
+
+         M     Mark
+         Mc    Spacing mark
+         Me    Enclosing mark
+         Mn    Non-spacing mark
+
+         N     Number
+         Nd    Decimal number
+         Nl    Letter number
+         No    Other number
+
+         P     Punctuation
+         Pc    Connector punctuation
+         Pd    Dash punctuation
+         Pe    Close punctuation
+         Pf    Final punctuation
+         Pi    Initial punctuation
+         Po    Other punctuation
+         Ps    Open punctuation
+
+         S     Symbol
+         Sc    Currency symbol
+         Sk    Modifier symbol
+         Sm    Mathematical symbol
+         So    Other symbol
+
+         Z     Separator
+         Zl    Line separator
+         Zp    Paragraph separator
+         Zs    Space separator
+
+       Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-
+       ported by PCRE.
+
+       Specifying caseless matching does not affect  these  escape  sequences.
+       For example, \p{Lu} always matches only upper case letters.
+
+       The  \X  escape  matches  any number of Unicode characters that form an
+       extended Unicode sequence. \X is equivalent to
+
+         (?>\PM\pM*)
+
+       That is, it matches a character without the "mark"  property,  followed
+       by  zero  or  more  characters with the "mark" property, and treats the
+       sequence as an atomic group (see below).  Characters  with  the  "mark"
+       property are typically accents that affect the preceding character.
+
+       Matching  characters  by Unicode property is not fast, because PCRE has
+       to search a structure that contains  data  for  over  fifteen  thousand
+       characters. That is why the traditional escape sequences such as \d and
+       \w do not use Unicode properties in PCRE.
+
+   Simple assertions
+
+       The fourth use of backslash is for certain simple assertions. An asser-
+       tion  specifies a condition that has to be met at a particular point in
+       a match, without consuming any characters from the subject string.  The
+       use  of subpatterns for more complicated assertions is described below.
+       The backslashed assertions are:
+
+         \b     matches at a word boundary
+         \B     matches when not at a word boundary
+         \A     matches at start of subject
+         \Z     matches at end of subject or before newline at end
+         \z     matches at end of subject
+         \G     matches at first matching position in subject
+
+       These assertions may not appear in character classes (but note that  \b
+       has a different meaning, namely the backspace character, inside a char-
+       acter class).
+
+       A word boundary is a position in the subject string where  the  current
+       character  and  the previous character do not both match \w or \W (i.e.
+       one matches \w and the other matches \W), or the start or  end  of  the
+       string if the first or last character matches \w, respectively.
+
+       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
+       and dollar (described in the next section) in that they only ever match
+       at  the  very start and end of the subject string, whatever options are
+       set. Thus, they are independent of multiline mode. These  three  asser-
+       tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
+       affect only the behaviour of the circumflex and dollar  metacharacters.
+       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
+       cating that matching is to start at a point other than the beginning of
+       the  subject,  \A  can never match. The difference between \Z and \z is
+       that \Z matches before a newline that is  the  last  character  of  the
+       string  as well as at the end of the string, whereas \z matches only at
+       the end.
+
+       The \G assertion is true only when the current matching position is  at
+       the  start point of the match, as specified by the startoffset argument
+       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
+       non-zero.  By calling pcre_exec() multiple times with appropriate argu-
+       ments, you can mimic Perl's /g option, and it is in this kind of imple-
+       mentation where \G can be useful.
+
+       Note,  however,  that  PCRE's interpretation of \G, as the start of the
+       current match, is subtly different from Perl's, which defines it as the
+       end  of  the  previous  match. In Perl, these can be different when the
+       previously matched string was empty. Because PCRE does just  one  match
+       at a time, it cannot reproduce this behaviour.
+
+       If  all  the alternatives of a pattern begin with \G, the expression is
+       anchored to the starting match position, and the "anchored" flag is set
+       in the compiled regular expression.
+
+
+CIRCUMFLEX AND DOLLAR
+
+       Outside a character class, in the default matching mode, the circumflex
+       character is an assertion that is true only  if  the  current  matching
+       point  is  at the start of the subject string. If the startoffset argu-
+       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
+       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
+       has an entirely different meaning (see below).
+
+       Circumflex need not be the first character of the pattern if  a  number
+       of  alternatives are involved, but it should be the first thing in each
+       alternative in which it appears if the pattern is ever  to  match  that
+       branch.  If all possible alternatives start with a circumflex, that is,
+       if the pattern is constrained to match only at the start  of  the  sub-
+       ject,  it  is  said  to be an "anchored" pattern. (There are also other
+       constructs that can cause a pattern to be anchored.)
+
+       A dollar character is an assertion that is true  only  if  the  current
+       matching  point  is  at  the  end of the subject string, or immediately
+       before a newline character that is the last character in the string (by
+       default).  Dollar  need  not  be the last character of the pattern if a
+       number of alternatives are involved, but it should be the last item  in
+       any  branch  in  which  it appears.  Dollar has no special meaning in a
+       character class.
+
+       The meaning of dollar can be changed so that it  matches  only  at  the
+       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
+       compile time. This does not affect the \Z assertion.
+
+       The meanings of the circumflex and dollar characters are changed if the
+       PCRE_MULTILINE option is set. When this is the case, they match immedi-
+       ately after and  immediately  before  an  internal  newline  character,
+       respectively,  in addition to matching at the start and end of the sub-
+       ject string. For example,  the  pattern  /^abc$/  matches  the  subject
+       string  "def\nabc"  (where \n represents a newline character) in multi-
+       line mode, but not otherwise.  Consequently, patterns that are anchored
+       in  single line mode because all branches start with ^ are not anchored
+       in multiline mode, and a match for  circumflex  is  possible  when  the
+       startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-
+       LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
+
+       Note that the sequences \A, \Z, and \z can be used to match  the  start
+       and  end of the subject in both modes, and if all branches of a pattern
+       start with \A it is always anchored, whether PCRE_MULTILINE is  set  or
+       not.
+
+
+FULL STOP (PERIOD, DOT)
+
+       Outside a character class, a dot in the pattern matches any one charac-
+       ter in the subject, including a non-printing  character,  but  not  (by
+       default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,
+       which might be more than one byte long, except (by default) newline. If
+       the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-
+       dling of dot is entirely independent of the handling of circumflex  and
+       dollar,  the  only  relationship  being  that they both involve newline
+       characters. Dot has no special meaning in a character class.
+
+
+MATCHING A SINGLE BYTE
+
+       Outside a character class, the escape sequence \C matches any one byte,
+       both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.
+       The feature is provided in Perl in order to match individual  bytes  in
+       UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual
+       bytes, what remains in the string may be a malformed UTF-8 string.  For
+       this reason, the \C escape sequence is best avoided.
+
+       PCRE  does  not  allow \C to appear in lookbehind assertions (described
+       below), because in UTF-8 mode this would make it impossible  to  calcu-
+       late the length of the lookbehind.
+
+
+SQUARE BRACKETS AND CHARACTER CLASSES
+
+       An opening square bracket introduces a character class, terminated by a
+       closing square bracket. A closing square bracket on its own is not spe-
+       cial. If a closing square bracket is required as a member of the class,
+       it should be the first data character in the class  (after  an  initial
+       circumflex, if present) or escaped with a backslash.
+
+       A  character  class matches a single character in the subject. In UTF-8
+       mode, the character may occupy more than one byte. A matched  character
+       must be in the set of characters defined by the class, unless the first
+       character in the class definition is a circumflex, in  which  case  the
+       subject  character  must  not  be in the set defined by the class. If a
+       circumflex is actually required as a member of the class, ensure it  is
+       not the first character, or escape it with a backslash.
+
+       For  example, the character class [aeiou] matches any lower case vowel,
+       while [^aeiou] matches any character that is not a  lower  case  vowel.
+       Note that a circumflex is just a convenient notation for specifying the
+       characters that are in the class by enumerating those that are  not.  A
+       class  that starts with a circumflex is not an assertion: it still con-
+       sumes a character from the subject string, and therefore  it  fails  if
+       the current pointer is at the end of the string.
+
+       In  UTF-8 mode, characters with values greater than 255 can be included
+       in a class as a literal string of bytes, or by using the  \x{  escaping
+       mechanism.
+
+       When  caseless  matching  is set, any letters in a class represent both
+       their upper case and lower case versions, so for  example,  a  caseless
+       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
+       match "A", whereas a caseful version would. When running in UTF-8 mode,
+       PCRE  supports  the  concept of case for characters with values greater
+       than 128 only when it is compiled with Unicode property support.
+
+       The newline character is never treated in any special way in  character
+       classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
+       options is. A class such as [^a] will always match a newline.
+
+       The minus (hyphen) character can be used to specify a range of  charac-
+       ters  in  a  character  class.  For  example,  [d-m] matches any letter
+       between d and m, inclusive. If a  minus  character  is  required  in  a
+       class,  it  must  be  escaped  with a backslash or appear in a position
+       where it cannot be interpreted as indicating a range, typically as  the
+       first or last character in the class.
+
+       It is not possible to have the literal character "]" as the end charac-
+       ter of a range. A pattern such as [W-]46] is interpreted as a class  of
+       two  characters ("W" and "-") followed by a literal string "46]", so it
+       would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a
+       backslash  it is interpreted as the end of range, so [W-\]46] is inter-
+       preted as a class containing a range followed by two other  characters.
+       The  octal or hexadecimal representation of "]" can also be used to end
+       a range.
+
+       Ranges operate in the collating sequence of character values. They  can
+       also   be  used  for  characters  specified  numerically,  for  example
+       [\000-\037]. In UTF-8 mode, ranges can include characters whose  values
+       are greater than 255, for example [\x{100}-\x{2ff}].
+
+       If a range that includes letters is used when caseless matching is set,
+       it matches the letters in either case. For example, [W-c] is equivalent
+       to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
+       character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
+       accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
+       concept of case for characters with values greater than 128  only  when
+       it is compiled with Unicode property support.
+
+       The  character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
+       in a character class, and add the characters that  they  match  to  the
+       class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
+       flex can conveniently be used with the upper case  character  types  to
+       specify  a  more  restricted  set of characters than the matching lower
+       case type. For example, the class [^\W_] matches any letter  or  digit,
+       but not underscore.
+
+       The  only  metacharacters  that are recognized in character classes are
+       backslash, hyphen (only where it can be  interpreted  as  specifying  a
+       range),  circumflex  (only  at the start), opening square bracket (only
+       when it can be interpreted as introducing a POSIX class name - see  the
+       next  section),  and  the  terminating closing square bracket. However,
+       escaping other non-alphanumeric characters does no harm.
+
+
+POSIX CHARACTER CLASSES
+
+       Perl supports the POSIX notation for character classes. This uses names
+       enclosed  by  [: and :] within the enclosing square brackets. PCRE also
+       supports this notation. For example,
+
+         [01[:alpha:]%]
+
+       matches "0", "1", any alphabetic character, or "%". The supported class
+       names are
+
+         alnum    letters and digits
+         alpha    letters
+         ascii    character codes 0 - 127
+         blank    space or tab only
+         cntrl    control characters
+         digit    decimal digits (same as \d)
+         graph    printing characters, excluding space
+         lower    lower case letters
+         print    printing characters, including space
+         punct    printing characters, excluding letters and digits
+         space    white space (not quite the same as \s)
+         upper    upper case letters
+         word     "word" characters (same as \w)
+         xdigit   hexadecimal digits
+
+       The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+       and space (32). Notice that this list includes the VT  character  (code
+       11). This makes "space" different to \s, which does not include VT (for
+       Perl compatibility).
+
+       The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
+       from  Perl  5.8. Another Perl extension is negation, which is indicated
+       by a ^ character after the colon. For example,
+
+         [12[:^digit:]]
+
+       matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
+       POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
+       these are not supported, and an error is given if they are encountered.
+
+       In UTF-8 mode, characters with values greater than 128 do not match any
+       of the POSIX character classes.
+
+
+VERTICAL BAR
+
+       Vertical bar characters are used to separate alternative patterns.  For
+       example, the pattern
+
+         gilbert|sullivan
+
+       matches  either "gilbert" or "sullivan". Any number of alternatives may
+       appear, and an empty  alternative  is  permitted  (matching  the  empty
+       string).   The  matching  process  tries each alternative in turn, from
+       left to right, and the first one that succeeds is used. If the alterna-
+       tives  are within a subpattern (defined below), "succeeds" means match-
+       ing the rest of the main pattern as well as the alternative in the sub-
+       pattern.
+
+
+INTERNAL OPTION SETTING
+
+       The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
+       PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
+       sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
+       option letters are
+
+         i  for PCRE_CASELESS
+         m  for PCRE_MULTILINE
+         s  for PCRE_DOTALL
+         x  for PCRE_EXTENDED
+
+       For example, (?im) sets caseless, multiline matching. It is also possi-
+       ble to unset these options by preceding the letter with a hyphen, and a
+       combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
+       LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
+       is also permitted. If a  letter  appears  both  before  and  after  the
+       hyphen, the option is unset.
+
+       When  an option change occurs at top level (that is, not inside subpat-
+       tern parentheses), the change applies to the remainder of  the  pattern
+       that follows.  If the change is placed right at the start of a pattern,
+       PCRE extracts it into the global options (and it will therefore show up
+       in data extracted by the pcre_fullinfo() function).
+
+       An option change within a subpattern affects only that part of the cur-
+       rent pattern that follows it, so
+
+         (a(?i)b)c
+
+       matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
+       used).   By  this means, options can be made to have different settings
+       in different parts of the pattern. Any changes made in one  alternative
+       do  carry  on  into subsequent branches within the same subpattern. For
+       example,
+
+         (a(?i)b|c)
+
+       matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
+       first  branch  is  abandoned before the option setting. This is because
+       the effects of option settings happen at compile time. There  would  be
+       some very weird behaviour otherwise.
+
+       The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
+       in the same way as the Perl-compatible options by using the  characters
+       U  and X respectively. The (?X) flag setting is special in that it must
+       always occur earlier in the pattern than any of the additional features
+       it  turns on, even when it is at top level. It is best to put it at the
+       start.
+
+
+SUBPATTERNS
+
+       Subpatterns are delimited by parentheses (round brackets), which can be
+       nested.  Turning part of a pattern into a subpattern does two things:
+
+       1. It localizes a set of alternatives. For example, the pattern
+
+         cat(aract|erpillar|)
+
+       matches  one  of the words "cat", "cataract", or "caterpillar". Without
+       the parentheses, it would match "cataract",  "erpillar"  or  the  empty
+       string.
+
+       2.  It  sets  up  the  subpattern as a capturing subpattern. This means
+       that, when the whole pattern  matches,  that  portion  of  the  subject
+       string that matched the subpattern is passed back to the caller via the
+       ovector argument of pcre_exec(). Opening parentheses are  counted  from
+       left  to  right  (starting  from 1) to obtain numbers for the capturing
+       subpatterns.
+
+       For example, if the string "the red king" is matched against  the  pat-
+       tern
+
+         the ((red|white) (king|queen))
+
+       the captured substrings are "red king", "red", and "king", and are num-
+       bered 1, 2, and 3, respectively.
+
+       The fact that plain parentheses fulfil  two  functions  is  not  always
+       helpful.   There are often times when a grouping subpattern is required
+       without a capturing requirement. If an opening parenthesis is  followed
+       by  a question mark and a colon, the subpattern does not do any captur-
+       ing, and is not counted when computing the  number  of  any  subsequent
+       capturing  subpatterns. For example, if the string "the white queen" is
+       matched against the pattern
+
+         the ((?:red|white) (king|queen))
+
+       the captured substrings are "white queen" and "queen", and are numbered
+       1  and 2. The maximum number of capturing subpatterns is 65535, and the
+       maximum depth of nesting of all subpatterns, both  capturing  and  non-
+       capturing, is 200.
+
+       As  a  convenient shorthand, if any option settings are required at the
+       start of a non-capturing subpattern,  the  option  letters  may  appear
+       between the "?" and the ":". Thus the two patterns
+
+         (?i:saturday|sunday)
+         (?:(?i)saturday|sunday)
+
+       match exactly the same set of strings. Because alternative branches are
+       tried from left to right, and options are not reset until  the  end  of
+       the  subpattern is reached, an option setting in one branch does affect
+       subsequent branches, so the above patterns match "SUNDAY"  as  well  as
+       "Saturday".
+
+
+NAMED SUBPATTERNS
+
+       Identifying  capturing  parentheses  by number is simple, but it can be
+       very hard to keep track of the numbers in complicated  regular  expres-
+       sions.  Furthermore,  if  an  expression  is  modified, the numbers may
+       change. To help with this difficulty, PCRE supports the naming of  sub-
+       patterns,  something  that  Perl  does  not  provide. The Python syntax
+       (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
+       underscores, and must be unique within a pattern.
+
+       Named  capturing  parentheses  are  still  allocated numbers as well as
+       names. The PCRE API provides function calls for extracting the name-to-
+       number  translation table from a compiled pattern. There is also a con-
+       venience function for extracting a captured substring by name. For fur-
+       ther details see the pcreapi documentation.
+
+
+REPETITION
+
+       Repetition  is  specified  by  quantifiers, which can follow any of the
+       following items:
+
+         a literal data character
+         the . metacharacter
+         the \C escape sequence
+         the \X escape sequence (in UTF-8 mode with Unicode properties)
+         an escape such as \d that matches a single character
+         a character class
+         a back reference (see next section)
+         a parenthesized subpattern (unless it is an assertion)
+
+       The general repetition quantifier specifies a minimum and maximum  num-
+       ber  of  permitted matches, by giving the two numbers in curly brackets
+       (braces), separated by a comma. The numbers must be  less  than  65536,
+       and the first must be less than or equal to the second. For example:
+
+         z{2,4}
+
+       matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
+       special character. If the second number is omitted, but  the  comma  is
+       present,  there  is  no upper limit; if the second number and the comma
+       are both omitted, the quantifier specifies an exact number of  required
+       matches. Thus
+
+         [aeiou]{3,}
+
+       matches at least 3 successive vowels, but may match many more, while
+
+         \d{8}
+
+       matches  exactly  8  digits. An opening curly bracket that appears in a
+       position where a quantifier is not allowed, or one that does not  match
+       the  syntax of a quantifier, is taken as a literal character. For exam-
+       ple, {,6} is not a quantifier, but a literal string of four characters.
+
+       In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to
+       individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
+       acters, each of which is represented by a two-byte sequence. Similarly,
+       when Unicode property support is available, \X{3} matches three Unicode
+       extended  sequences,  each of which may be several bytes long (and they
+       may be of different lengths).
+
+       The quantifier {0} is permitted, causing the expression to behave as if
+       the previous item and the quantifier were not present.
+
+       For  convenience  (and  historical compatibility) the three most common
+       quantifiers have single-character abbreviations:
+
+         *    is equivalent to {0,}
+         +    is equivalent to {1,}
+         ?    is equivalent to {0,1}
+
+       It is possible to construct infinite loops by  following  a  subpattern
+       that can match no characters with a quantifier that has no upper limit,
+       for example:
+
+         (a?)*
+
+       Earlier versions of Perl and PCRE used to give an error at compile time
+       for  such  patterns. However, because there are cases where this can be
+       useful, such patterns are now accepted, but if any  repetition  of  the
+       subpattern  does in fact match no characters, the loop is forcibly bro-
+       ken.
+
+       By default, the quantifiers are "greedy", that is, they match  as  much
+       as  possible  (up  to  the  maximum number of permitted times), without
+       causing the rest of the pattern to fail. The classic example  of  where
+       this gives problems is in trying to match comments in C programs. These
+       appear between /* and */ and within the comment,  individual  *  and  /
+       characters  may  appear. An attempt to match C comments by applying the
+       pattern
+
+         /\*.*\*/
+
+       to the string
+
+         /* first comment */  not comment  /* second comment */
+
+       fails, because it matches the entire string owing to the greediness  of
+       the .*  item.
+
+       However,  if  a quantifier is followed by a question mark, it ceases to
+       be greedy, and instead matches the minimum number of times possible, so
+       the pattern
+
+         /\*.*?\*/
+
+       does  the  right  thing with the C comments. The meaning of the various
+       quantifiers is not otherwise changed,  just  the  preferred  number  of
+       matches.   Do  not  confuse this use of question mark with its use as a
+       quantifier in its own right. Because it has two uses, it can  sometimes
+       appear doubled, as in
+
+         \d??\d
+
+       which matches one digit by preference, but can match two if that is the
+       only way the rest of the pattern matches.
+
+       If the PCRE_UNGREEDY option is set (an option which is not available in
+       Perl),  the  quantifiers are not greedy by default, but individual ones
+       can be made greedy by following them with a  question  mark.  In  other
+       words, it inverts the default behaviour.
+
+       When  a  parenthesized  subpattern  is quantified with a minimum repeat
+       count that is greater than 1 or with a limited maximum, more memory  is
+       required  for  the  compiled  pattern, in proportion to the size of the
+       minimum or maximum.
+
+       If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
+       alent  to Perl's /s) is set, thus allowing the . to match newlines, the
+       pattern is implicitly anchored, because whatever follows will be  tried
+       against  every character position in the subject string, so there is no
+       point in retrying the overall match at any position  after  the  first.
+       PCRE normally treats such a pattern as though it were preceded by \A.
+
+       In  cases  where  it  is known that the subject string contains no new-
+       lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
+       mization, or alternatively using ^ to indicate anchoring explicitly.
+
+       However,  there is one situation where the optimization cannot be used.
+       When .*  is inside capturing parentheses that  are  the  subject  of  a
+       backreference  elsewhere in the pattern, a match at the start may fail,
+       and a later one succeed. Consider, for example:
+
+         (.*)abc\1
+
+       If the subject is "xyz123abc123" the match point is the fourth  charac-
+       ter. For this reason, such a pattern is not implicitly anchored.
+
+       When a capturing subpattern is repeated, the value captured is the sub-
+       string that matched the final iteration. For example, after
+
+         (tweedle[dume]{3}\s*)+
+
+       has matched "tweedledum tweedledee" the value of the captured substring
+       is  "tweedledee".  However,  if there are nested capturing subpatterns,
+       the corresponding captured values may have been set in previous  itera-
+       tions. For example, after
+
+         /(a|(b))+/
+
+       matches "aba" the value of the second captured substring is "b".
+
+
+ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
+
+       With both maximizing and minimizing repetition, failure of what follows
+       normally causes the repeated item to be re-evaluated to see if  a  dif-
+       ferent number of repeats allows the rest of the pattern to match. Some-
+       times it is useful to prevent this, either to change the nature of  the
+       match,  or  to  cause it fail earlier than it otherwise might, when the
+       author of the pattern knows there is no point in carrying on.
+
+       Consider, for example, the pattern \d+foo when applied to  the  subject
+       line
+
+         123456bar
+
+       After matching all 6 digits and then failing to match "foo", the normal
+       action of the matcher is to try again with only 5 digits  matching  the
+       \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
+       "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
+       the  means for specifying that once a subpattern has matched, it is not
+       to be re-evaluated in this way.
+
+       If we use atomic grouping for the previous example, the  matcher  would
+       give up immediately on failing to match "foo" the first time. The nota-
+       tion is a kind of special parenthesis, starting with  (?>  as  in  this
+       example:
+
+         (?>\d+)foo
+
+       This  kind  of  parenthesis "locks up" the  part of the pattern it con-
+       tains once it has matched, and a failure further into  the  pattern  is
+       prevented  from  backtracking into it. Backtracking past it to previous
+       items, however, works as normal.
+
+       An alternative description is that a subpattern of  this  type  matches
+       the  string  of  characters  that an identical standalone pattern would
+       match, if anchored at the current point in the subject string.
+
+       Atomic grouping subpatterns are not capturing subpatterns. Simple cases
+       such as the above example can be thought of as a maximizing repeat that
+       must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
+       pared  to  adjust  the number of digits they match in order to make the
+       rest of the pattern match, (?>\d+) can only match an entire sequence of
+       digits.
+
+       Atomic  groups in general can of course contain arbitrarily complicated
+       subpatterns, and can be nested. However, when  the  subpattern  for  an
+       atomic group is just a single repeated item, as in the example above, a
+       simpler notation, called a "possessive quantifier" can  be  used.  This
+       consists  of  an  additional  + character following a quantifier. Using
+       this notation, the previous example can be rewritten as
+
+         \d++foo
+
+       Possessive  quantifiers  are  always  greedy;  the   setting   of   the
+       PCRE_UNGREEDY option is ignored. They are a convenient notation for the
+       simpler forms of atomic group. However, there is no difference  in  the
+       meaning  or  processing  of  a possessive quantifier and the equivalent
+       atomic group.
+
+       The possessive quantifier syntax is an extension to the Perl syntax. It
+       originates in Sun's Java package.
+
+       When  a  pattern  contains an unlimited repeat inside a subpattern that
+       can itself be repeated an unlimited number of  times,  the  use  of  an
+       atomic  group  is  the  only way to avoid some failing matches taking a
+       very long time indeed. The pattern
+
+         (\D+|<\d+>)*[!?]
+
+       matches an unlimited number of substrings that either consist  of  non-
+       digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
+       matches, it runs quickly. However, if it is applied to
+
+         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+
+       it takes a long time before reporting  failure.  This  is  because  the
+       string  can be divided between the internal \D+ repeat and the external
+       * repeat in a large number of ways, and all  have  to  be  tried.  (The
+       example  uses  [!?]  rather than a single character at the end, because
+       both PCRE and Perl have an optimization that allows  for  fast  failure
+       when  a single character is used. They remember the last single charac-
+       ter that is required for a match, and fail early if it is  not  present
+       in  the  string.)  If  the pattern is changed so that it uses an atomic
+       group, like this:
+
+         ((?>\D+)|<\d+>)*[!?]
+
+       sequences of non-digits cannot be broken, and failure happens  quickly.
+
+
+BACK REFERENCES
+
+       Outside a character class, a backslash followed by a digit greater than
+       0 (and possibly further digits) is a back reference to a capturing sub-
+       pattern  earlier  (that is, to its left) in the pattern, provided there
+       have been that many previous capturing left parentheses.
+
+       However, if the decimal number following the backslash is less than 10,
+       it  is  always  taken  as a back reference, and causes an error only if
+       there are not that many capturing left parentheses in the  entire  pat-
+       tern.  In  other words, the parentheses that are referenced need not be
+       to the left of the reference for numbers less than 10. See the  subsec-
+       tion  entitled  "Non-printing  characters" above for further details of
+       the handling of digits following a backslash.
+
+       A back reference matches whatever actually matched the  capturing  sub-
+       pattern  in  the  current subject string, rather than anything matching
+       the subpattern itself (see "Subpatterns as subroutines" below for a way
+       of doing that). So the pattern
+
+         (sens|respons)e and \1ibility
+
+       matches  "sense and sensibility" and "response and responsibility", but
+       not "sense and responsibility". If caseful matching is in force at  the
+       time  of the back reference, the case of letters is relevant. For exam-
+       ple,
+
+         ((?i)rah)\s+\1
+
+       matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
+       original capturing subpattern is matched caselessly.
+
+       Back  references  to named subpatterns use the Python syntax (?P=name).
+       We could rewrite the above example as follows:
+
+         (?<p1>(?i)rah)\s+(?P=p1)
+
+       There may be more than one back reference to the same subpattern. If  a
+       subpattern  has  not actually been used in a particular match, any back
+       references to it always fail. For example, the pattern
+
+         (a|(bc))\2
+
+       always fails if it starts to match "a" rather than "bc". Because  there
+       may  be  many  capturing parentheses in a pattern, all digits following
+       the backslash are taken as part of a potential back  reference  number.
+       If the pattern continues with a digit character, some delimiter must be
+       used to terminate the back reference. If the  PCRE_EXTENDED  option  is
+       set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-
+       ments" below) can be used.
+
+       A back reference that occurs inside the parentheses to which it  refers
+       fails  when  the subpattern is first used, so, for example, (a\1) never
+       matches.  However, such references can be useful inside  repeated  sub-
+       patterns. For example, the pattern
+
+         (a|b\1)+
+
+       matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
+       ation of the subpattern,  the  back  reference  matches  the  character
+       string  corresponding  to  the previous iteration. In order for this to
+       work, the pattern must be such that the first iteration does  not  need
+       to  match the back reference. This can be done using alternation, as in
+       the example above, or by a quantifier with a minimum of zero.
+
+
+ASSERTIONS
+
+       An assertion is a test on the characters  following  or  preceding  the
+       current  matching  point that does not actually consume any characters.
+       The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
+       described above.
+
+       More  complicated  assertions  are  coded as subpatterns. There are two
+       kinds: those that look ahead of the current  position  in  the  subject
+       string,  and  those  that  look  behind  it. An assertion subpattern is
+       matched in the normal way, except that it does not  cause  the  current
+       matching position to be changed.
+
+       Assertion  subpatterns  are  not  capturing subpatterns, and may not be
+       repeated, because it makes no sense to assert the  same  thing  several
+       times.  If  any kind of assertion contains capturing subpatterns within
+       it, these are counted for the purposes of numbering the capturing  sub-
+       patterns in the whole pattern.  However, substring capturing is carried
+       out only for positive assertions, because it does not  make  sense  for
+       negative assertions.
+
+   Lookahead assertions
+
+       Lookahead assertions start with (?= for positive assertions and (?! for
+       negative assertions. For example,
+
+         \w+(?=;)
+
+       matches a word followed by a semicolon, but does not include the  semi-
+       colon in the match, and
+
+         foo(?!bar)
+
+       matches  any  occurrence  of  "foo" that is not followed by "bar". Note
+       that the apparently similar pattern
+
+         (?!foo)bar
+
+       does not find an occurrence of "bar"  that  is  preceded  by  something
+       other  than "foo"; it finds any occurrence of "bar" whatsoever, because
+       the assertion (?!foo) is always true when the next three characters are
+       "bar". A lookbehind assertion is needed to achieve the other effect.
+
+       If you want to force a matching failure at some point in a pattern, the
+       most convenient way to do it is  with  (?!)  because  an  empty  string
+       always  matches, so an assertion that requires there not to be an empty
+       string must always fail.
+
+   Lookbehind assertions
+
+       Lookbehind assertions start with (?<= for positive assertions and  (?<!
+       for negative assertions. For example,
+
+         (?<!foo)bar
+
+       does  find  an  occurrence  of "bar" that is not preceded by "foo". The
+       contents of a lookbehind assertion are restricted  such  that  all  the
+       strings it matches must have a fixed length. However, if there are sev-
+       eral alternatives, they do not all have to have the same fixed  length.
+       Thus
+
+         (?<=bullock|donkey)
+
+       is permitted, but
+
+         (?<!dogs?|cats?)
+
+       causes  an  error at compile time. Branches that match different length
+       strings are permitted only at the top level of a lookbehind  assertion.
+       This  is  an  extension  compared  with  Perl (at least for 5.8), which
+       requires all branches to match the same length of string. An  assertion
+       such as
+
+         (?<=ab(c|de))
+
+       is  not  permitted,  because  its single top-level branch can match two
+       different lengths, but it is acceptable if rewritten to  use  two  top-
+       level branches:
+
+         (?<=abc|abde)
+
+       The  implementation  of lookbehind assertions is, for each alternative,
+       to temporarily move the current position back by the  fixed  width  and
+       then try to match. If there are insufficient characters before the cur-
+       rent position, the match is deemed to fail.
+
+       PCRE does not allow the \C escape (which matches a single byte in UTF-8
+       mode)  to appear in lookbehind assertions, because it makes it impossi-
+       ble to calculate the length of the lookbehind. The \X escape, which can
+       match different numbers of bytes, is also not permitted.
+
+       Atomic  groups can be used in conjunction with lookbehind assertions to
+       specify efficient matching at the end of the subject string. Consider a
+       simple pattern such as
+
+         abcd$
+
+       when  applied  to  a  long string that does not match. Because matching
+       proceeds from left to right, PCRE will look for each "a" in the subject
+       and  then  see  if what follows matches the rest of the pattern. If the
+       pattern is specified as
+
+         ^.*abcd$
+
+       the initial .* matches the entire string at first, but when this  fails
+       (because there is no following "a"), it backtracks to match all but the
+       last character, then all but the last two characters, and so  on.  Once
+       again  the search for "a" covers the entire string, from right to left,
+       so we are no better off. However, if the pattern is written as
+
+         ^(?>.*)(?<=abcd)
+
+       or, equivalently, using the possessive quantifier syntax,
+
+         ^.*+(?<=abcd)
+
+       there can be no backtracking for the .* item; it  can  match  only  the
+       entire  string.  The subsequent lookbehind assertion does a single test
+       on the last four characters. If it fails, the match fails  immediately.
+       For  long  strings, this approach makes a significant difference to the
+       processing time.
+
+   Using multiple assertions
+
+       Several assertions (of any sort) may occur in succession. For example,
+
+         (?<=\d{3})(?<!999)foo
+
+       matches "foo" preceded by three digits that are not "999". Notice  that
+       each  of  the  assertions is applied independently at the same point in
+       the subject string. First there is a  check  that  the  previous  three
+       characters  are  all  digits,  and  then there is a check that the same
+       three characters are not "999".  This pattern does not match "foo" pre-
+       ceded  by  six  characters,  the first of which are digits and the last
+       three of which are not "999". For example, it  doesn't  match  "123abc-
+       foo". A pattern to do that is
+
+         (?<=\d{3}...)(?<!999)foo
+
+       This  time  the  first assertion looks at the preceding six characters,
+       checking that the first three are digits, and then the second assertion
+       checks that the preceding three characters are not "999".
+
+       Assertions can be nested in any combination. For example,
+
+         (?<=(?<!foo)bar)baz
+
+       matches  an occurrence of "baz" that is preceded by "bar" which in turn
+       is not preceded by "foo", while
+
+         (?<=\d{3}(?!999)...)foo
+
+       is another pattern that matches "foo" preceded by three digits and  any
+       three characters that are not "999".
+
+
+CONDITIONAL SUBPATTERNS
+
+       It  is possible to cause the matching process to obey a subpattern con-
+       ditionally or to choose between two alternative subpatterns,  depending
+       on  the result of an assertion, or whether a previous capturing subpat-
+       tern matched or not. The two possible forms of  conditional  subpattern
+       are
+
+         (?(condition)yes-pattern)
+         (?(condition)yes-pattern|no-pattern)
+
+       If  the  condition is satisfied, the yes-pattern is used; otherwise the
+       no-pattern (if present) is used. If there are more  than  two  alterna-
+       tives in the subpattern, a compile-time error occurs.
+
+       There are three kinds of condition. If the text between the parentheses
+       consists of a sequence of digits, the condition  is  satisfied  if  the
+       capturing  subpattern of that number has previously matched. The number
+       must be greater than zero. Consider the following pattern,  which  con-
+       tains  non-significant white space to make it more readable (assume the
+       PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of
+       discussion:
+
+         ( \( )?    [^()]+    (?(1) \) )
+
+       The  first  part  matches  an optional opening parenthesis, and if that
+       character is present, sets it as the first captured substring. The sec-
+       ond  part  matches one or more characters that are not parentheses. The
+       third part is a conditional subpattern that tests whether the first set
+       of parentheses matched or not. If they did, that is, if subject started
+       with an opening parenthesis, the condition is true, and so the yes-pat-
+       tern  is  executed  and  a  closing parenthesis is required. Otherwise,
+       since no-pattern is not present, the  subpattern  matches  nothing.  In
+       other  words,  this  pattern  matches  a  sequence  of non-parentheses,
+       optionally enclosed in parentheses.
+
+       If the condition is the string (R), it is satisfied if a recursive call
+       to  the pattern or subpattern has been made. At "top level", the condi-
+       tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are
+       described in the next section.
+
+       If  the  condition  is  not  a sequence of digits or (R), it must be an
+       assertion.  This may be a positive or negative lookahead or  lookbehind
+       assertion.  Consider  this  pattern,  again  containing non-significant
+       white space, and with the two alternatives on the second line:
+
+         (?(?=[^a-z]*[a-z])
+         \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
+
+       The condition  is  a  positive  lookahead  assertion  that  matches  an
+       optional  sequence of non-letters followed by a letter. In other words,
+       it tests for the presence of at least one letter in the subject.  If  a
+       letter  is found, the subject is matched against the first alternative;
+       otherwise it is  matched  against  the  second.  This  pattern  matches
+       strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+       letters and dd are digits.
+
+
+COMMENTS
+
+       The sequence (?# marks the start of a comment that continues up to  the
+       next  closing  parenthesis.  Nested  parentheses are not permitted. The
+       characters that make up a comment play no part in the pattern  matching
+       at all.
+
+       If  the PCRE_EXTENDED option is set, an unescaped # character outside a
+       character class introduces a comment that continues up to the next new-
+       line character in the pattern.
+
+
+RECURSIVE PATTERNS
+
+       Consider  the problem of matching a string in parentheses, allowing for
+       unlimited nested parentheses. Without the use of  recursion,  the  best
+       that  can  be  done  is  to use a pattern that matches up to some fixed
+       depth of nesting. It is not possible to  handle  an  arbitrary  nesting
+       depth.  Perl  provides  a  facility  that allows regular expressions to
+       recurse (amongst other things). It does this by interpolating Perl code
+       in the expression at run time, and the code can refer to the expression
+       itself. A Perl pattern to solve the parentheses problem can be  created
+       like this:
+
+         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
+
+       The (?p{...}) item interpolates Perl code at run time, and in this case
+       refers recursively to the pattern in which it appears. Obviously,  PCRE
+       cannot  support  the  interpolation  of Perl code. Instead, it supports
+       some special syntax for recursion of the entire pattern, and  also  for
+       individual subpattern recursion.
+
+       The  special item that consists of (? followed by a number greater than
+       zero and a closing parenthesis is a recursive call of the subpattern of
+       the  given  number, provided that it occurs inside that subpattern. (If
+       not, it is a "subroutine" call, which is described  in  the  next  sec-
+       tion.)  The special item (?R) is a recursive call of the entire regular
+       expression.
+
+       For example, this PCRE pattern solves the  nested  parentheses  problem
+       (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is
+       ignored):
+
+         \( ( (?>[^()]+) | (?R) )* \)
+
+       First it matches an opening parenthesis. Then it matches any number  of
+       substrings  which  can  either  be  a sequence of non-parentheses, or a
+       recursive match of the pattern itself (that is  a  correctly  parenthe-
+       sized substring).  Finally there is a closing parenthesis.
+
+       If  this  were  part of a larger pattern, you would not want to recurse
+       the entire pattern, so instead you could use this:
+
+         ( \( ( (?>[^()]+) | (?1) )* \) )
+
+       We have put the pattern into parentheses, and caused the  recursion  to
+       refer  to them instead of the whole pattern. In a larger pattern, keep-
+       ing track of parenthesis numbers can be tricky. It may be  more  conve-
+       nient  to use named parentheses instead. For this, PCRE uses (?P>name),
+       which is an extension to the Python syntax that  PCRE  uses  for  named
+       parentheses (Perl does not provide named parentheses). We could rewrite
+       the above example as follows:
+
+         (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
+
+       This particular example pattern contains nested unlimited repeats,  and
+       so  the  use of atomic grouping for matching strings of non-parentheses
+       is important when applying the pattern to strings that  do  not  match.
+       For example, when this pattern is applied to
+
+         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
+
+       it  yields "no match" quickly. However, if atomic grouping is not used,
+       the match runs for a very long time indeed because there  are  so  many
+       different  ways  the  + and * repeats can carve up the subject, and all
+       have to be tested before failure can be reported.
+
+       At the end of a match, the values set for any capturing subpatterns are
+       those from the outermost level of the recursion at which the subpattern
+       value is set.  If you want to obtain  intermediate  values,  a  callout
+       function can be used (see the next section and the pcrecallout documen-
+       tation). If the pattern above is matched against
+
+         (ab(cd)ef)
+
+       the value for the capturing parentheses is  "ef",  which  is  the  last
+       value  taken  on at the top level. If additional parentheses are added,
+       giving
+
+         \( ( ( (?>[^()]+) | (?R) )* ) \)
+            ^                        ^
+            ^                        ^
+
+       the string they capture is "ab(cd)ef", the contents of  the  top  level
+       parentheses.  If there are more than 15 capturing parentheses in a pat-
+       tern, PCRE has to obtain extra memory to store data during a recursion,
+       which  it  does  by  using pcre_malloc, freeing it via pcre_free after-
+       wards. If  no  memory  can  be  obtained,  the  match  fails  with  the
+       PCRE_ERROR_NOMEMORY error.
+
+       Do  not  confuse  the (?R) item with the condition (R), which tests for
+       recursion.  Consider this pattern, which matches text in  angle  brack-
+       ets,  allowing for arbitrary nesting. Only digits are allowed in nested
+       brackets (that is, when recursing), whereas any characters are  permit-
+       ted at the outer level.
+
+         < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
+
+       In  this  pattern, (?(R) is the start of a conditional subpattern, with
+       two different alternatives for the recursive and  non-recursive  cases.
+       The (?R) item is the actual recursive call.
+
+
+SUBPATTERNS AS SUBROUTINES
+
+       If the syntax for a recursive subpattern reference (either by number or
+       by name) is used outside the parentheses to which it refers,  it  oper-
+       ates  like  a  subroutine in a programming language. An earlier example
+       pointed out that the pattern
+
+         (sens|respons)e and \1ibility
+
+       matches "sense and sensibility" and "response and responsibility",  but
+       not "sense and responsibility". If instead the pattern
+
+         (sens|respons)e and (?1)ibility
+
+       is  used, it does match "sense and responsibility" as well as the other
+       two strings. Such references must, however, follow  the  subpattern  to
+       which they refer.
+
+
+CALLOUTS
+
+       Perl has a feature whereby using the sequence (?{...}) causes arbitrary
+       Perl code to be obeyed in the middle of matching a regular  expression.
+       This makes it possible, amongst other things, to extract different sub-
+       strings that match the same pair of parentheses when there is a repeti-
+       tion.
+
+       PCRE provides a similar feature, but of course it cannot obey arbitrary
+       Perl code. The feature is called "callout". The caller of PCRE provides
+       an  external function by putting its entry point in the global variable
+       pcre_callout.  By default, this variable contains NULL, which  disables
+       all calling out.
+
+       Within  a  regular  expression,  (?C) indicates the points at which the
+       external function is to be called. If you want  to  identify  different
+       callout  points, you can put a number less than 256 after the letter C.
+       The default value is zero.  For example, this pattern has  two  callout
+       points:
+
+         (?C1)abc(?C2)def
+
+       If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
+       automatically installed before each item in the pattern. They  are  all
+       numbered 255.
+
+       During matching, when PCRE reaches a callout point (and pcre_callout is
+       set), the external function is called. It is provided with  the  number
+       of  the callout, the position in the pattern, and, optionally, one item
+       of data originally supplied by the caller of pcre_exec().  The  callout
+       function  may cause matching to proceed, to backtrack, or to fail alto-
+       gether. A complete description of the interface to the callout function
+       is given in the pcrecallout documentation.
+
+Last updated: 09 September 2004
+Copyright (c) 1997-2004 University of Cambridge.
author	Philip Hazel <ph10@hermes.cam.ac.uk>	2004-10-07 15:04:35 +0000
committer	Philip Hazel <ph10@hermes.cam.ac.uk>	2004-10-07 15:04:35 +0000
commit	495ae4b01f36d0d8bb0e34a1d7263c2b8224aa4a (patch)
tree	fcfaa2c623d4f155eef907b50b950b602829a30b /doc/doc-txt/pcrepattern.txt
parent	0756eb3cb50d73a77b486e47528f7cb1bffdb299 (diff)