diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/doc-txt/pcrepattern.txt | 1832 | ||||
-rw-r--r-- | doc/doc-txt/pcretest.txt | 630 |
2 files changed, 0 insertions, 2462 deletions
diff --git a/doc/doc-txt/pcrepattern.txt b/doc/doc-txt/pcrepattern.txt deleted file mode 100644 index bfc1cab4c..000000000 --- a/doc/doc-txt/pcrepattern.txt +++ /dev/null @@ -1,1832 +0,0 @@ -This file contains the PCRE man page that describes the regular expressions -supported by PCRE version 7.2. Note that not all of the features are relevant -in the context of Exim. In particular, the version of PCRE that is compiled -with Exim does not include UTF-8 support, there is no mechanism for changing -the options with which the PCRE functions are called, and features such as -callout are not accessible. ------------------------------------------------------------------------------ - -PCREPATTERN(3) PCREPATTERN(3) - - -NAME - PCRE - Perl-compatible regular expressions - - -PCRE REGULAR EXPRESSION DETAILS - - The syntax and semantics of the regular expressions supported by PCRE - are described below. Regular expressions are also described in the Perl - documentation and in a number of books, some of which have copious - examples. Jeffrey Friedl's "Mastering Regular Expressions", published - by O'Reilly, covers regular expressions in great detail. This descrip- - tion of PCRE's regular expressions is intended as reference material. - - The original operation of PCRE was on strings of one-byte characters. - However, there is now also support for UTF-8 character strings. To use - this, you must build PCRE to include UTF-8 support, and then call - pcre_compile() with the PCRE_UTF8 option. How this affects pattern - matching is mentioned in several places below. There is also a summary - of UTF-8 features in the section on UTF-8 support in the main pcre - page. - - The remainder of this document discusses the patterns that are sup- - ported by PCRE when its main matching function, pcre_exec(), is used. - From release 6.0, PCRE offers a second matching function, - pcre_dfa_exec(), which matches using a different algorithm that is not - Perl-compatible. Some of the features discussed below are not available - when pcre_dfa_exec() is used. The advantages and disadvantages of the - alternative function, and how it differs from the normal function, are - discussed in the pcrematching page. - - -CHARACTERS AND METACHARACTERS - - A regular expression is a pattern that is matched against a subject - string from left to right. Most characters stand for themselves in a - pattern, and match the corresponding characters in the subject. As a - trivial example, the pattern - - The quick brown fox - - matches a portion of a subject string that is identical to itself. When - caseless matching is specified (the PCRE_CASELESS option), letters are - matched independently of case. In UTF-8 mode, PCRE always understands - the concept of case for characters whose values are less than 128, so - caseless matching is always possible. For characters with higher val- - ues, the concept of case is supported if PCRE is compiled with Unicode - property support, but not otherwise. If you want to use caseless - matching for characters 128 and above, you must ensure that PCRE is - compiled with Unicode property support as well as with UTF-8 support. - - The power of regular expressions comes from the ability to include - alternatives and repetitions in the pattern. These are encoded in the - pattern by the use of metacharacters, which do not stand for themselves - but instead are interpreted in some special way. - - There are two different sets of metacharacters: those that are recog- - nized anywhere in the pattern except within square brackets, and those - that are recognized within square brackets. Outside square brackets, - the metacharacters are as follows: - - \ general escape character with several uses - ^ assert start of string (or line, in multiline mode) - $ assert end of string (or line, in multiline mode) - . match any character except newline (by default) - [ start character class definition - | start of alternative branch - ( start subpattern - ) end subpattern - ? extends the meaning of ( - also 0 or 1 quantifier - also quantifier minimizer - * 0 or more quantifier - + 1 or more quantifier - also "possessive quantifier" - { start min/max quantifier - - Part of a pattern that is in square brackets is called a "character - class". In a character class the only metacharacters are: - - \ general escape character - ^ negate the class, but only if the first character - - indicates character range - [ POSIX character class (only if followed by POSIX - syntax) - ] terminates the character class - - The following sections describe the use of each of the metacharacters. - - -BACKSLASH - - The backslash character has several uses. Firstly, if it is followed by - a non-alphanumeric character, it takes away any special meaning that - character may have. This use of backslash as an escape character - applies both inside and outside character classes. - - For example, if you want to match a * character, you write \* in the - pattern. This escaping action applies whether or not the following - character would otherwise be interpreted as a metacharacter, so it is - always safe to precede a non-alphanumeric with backslash to specify - that it stands for itself. In particular, if you want to match a back- - slash, you write \\. - - If a pattern is compiled with the PCRE_EXTENDED option, whitespace in - the pattern (other than in a character class) and characters between a - # outside a character class and the next newline are ignored. An escap- - ing backslash can be used to include a whitespace or # character as - part of the pattern. - - If you want to remove the special meaning from a sequence of charac- - ters, you can do so by putting them between \Q and \E. This is differ- - ent from Perl in that $ and @ are handled as literals in \Q...\E - sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- - tion. Note the following examples: - - Pattern PCRE matches Perl matches - - \Qabc$xyz\E abc$xyz abc followed by the - contents of $xyz - \Qabc\$xyz\E abc\$xyz abc\$xyz - \Qabc\E\$\Qxyz\E abc$xyz abc$xyz - - The \Q...\E sequence is recognized both inside and outside character - classes. - - Non-printing characters - - A second use of backslash provides a way of encoding non-printing char- - acters in patterns in a visible manner. There is no restriction on the - appearance of non-printing characters, apart from the binary zero that - terminates a pattern, but when a pattern is being prepared by text - editing, it is usually easier to use one of the following escape - sequences than the binary character it represents: - - \a alarm, that is, the BEL character (hex 07) - \cx "control-x", where x is any character - \e escape (hex 1B) - \f formfeed (hex 0C) - \n newline (hex 0A) - \r carriage return (hex 0D) - \t tab (hex 09) - \ddd character with octal code ddd, or backreference - \xhh character with hex code hh - \x{hhh..} character with hex code hhh.. - - The precise effect of \cx is as follows: if x is a lower case letter, - it is converted to upper case. Then bit 6 of the character (hex 40) is - inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; - becomes hex 7B. - - After \x, from zero to two hexadecimal digits are read (letters can be - in upper or lower case). Any number of hexadecimal digits may appear - between \x{ and }, but the value of the character code must be less - than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, - the maximum hexadecimal value is 7FFFFFFF). If characters other than - hexadecimal digits appear between \x{ and }, or if there is no termi- - nating }, this form of escape is not recognized. Instead, the initial - \x will be interpreted as a basic hexadecimal escape, with no following - digits, giving a character whose value is zero. - - Characters whose value is less than 256 can be defined by either of the - two syntaxes for \x. There is no difference in the way they are han- - dled. For example, \xdc is exactly the same as \x{dc}. - - After \0 up to two further octal digits are read. If there are fewer - than two digits, just those that are present are used. Thus the - sequence \0\x\07 specifies two binary zeros followed by a BEL character - (code value 7). Make sure you supply two digits after the initial zero - if the pattern character that follows is itself an octal digit. - - The handling of a backslash followed by a digit other than 0 is compli- - cated. Outside a character class, PCRE reads it and any following dig- - its as a decimal number. If the number is less than 10, or if there - have been at least that many previous capturing left parentheses in the - expression, the entire sequence is taken as a back reference. A - description of how this works is given later, following the discussion - of parenthesized subpatterns. - - Inside a character class, or if the decimal number is greater than 9 - and there have not been that many capturing subpatterns, PCRE re-reads - up to three octal digits following the backslash, and uses them to gen- - erate a data character. Any subsequent digits stand for themselves. In - non-UTF-8 mode, the value of a character specified in octal must be - less than \400. In UTF-8 mode, values up to \777 are permitted. For - example: - - \040 is another way of writing a space - \40 is the same, provided there are fewer than 40 - previous capturing subpatterns - \7 is always a back reference - \11 might be a back reference, or another way of - writing a tab - \011 is always a tab - \0113 is a tab followed by the character "3" - \113 might be a back reference, otherwise the - character with octal code 113 - \377 might be a back reference, otherwise - the byte consisting entirely of 1 bits - \81 is either a back reference, or a binary zero - followed by the two characters "8" and "1" - - Note that octal values of 100 or greater must not be introduced by a - leading zero, because no more than three octal digits are ever read. - - All the sequences that define a single character value can be used both - inside and outside character classes. In addition, inside a character - class, the sequence \b is interpreted as the backspace character (hex - 08), and the sequences \R and \X are interpreted as the characters "R" - and "X", respectively. Outside a character class, these sequences have - different meanings (see below). - - Absolute and relative back references - - The sequence \g followed by a positive or negative number, optionally - enclosed in braces, is an absolute or relative back reference. A named - back reference can be coded as \g{name}. Back references are discussed - later, following the discussion of parenthesized subpatterns. - - Generic character types - - Another use of backslash is for specifying generic character types. The - following are always recognized: - - \d any decimal digit - \D any character that is not a decimal digit - \h any horizontal whitespace character - \H any character that is not a horizontal whitespace character - \s any whitespace character - \S any character that is not a whitespace character - \v any vertical whitespace character - \V any character that is not a vertical whitespace character - \w any "word" character - \W any "non-word" character - - Each pair of escape sequences partitions the complete set of characters - into two disjoint sets. Any given character matches one, and only one, - of each pair. - - These character type sequences can appear both inside and outside char- - acter classes. They each match one character of the appropriate type. - If the current matching point is at the end of the subject string, all - of them fail, since there is no character to match. - - For compatibility with Perl, \s does not match the VT character (code - 11). This makes it different from the the POSIX "space" class. The \s - characters are HT (9), LF (10), FF (12), CR (13), and space (32). If - "use locale;" is included in a Perl script, \s may match the VT charac- - ter. In PCRE, it never does. - - In UTF-8 mode, characters with values greater than 128 never match \d, - \s, or \w, and always match \D, \S, and \W. This is true even when Uni- - code character property support is available. These sequences retain - their original meanings from before UTF-8 support was available, mainly - for efficiency reasons. - - The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to - the other sequences, these do match certain high-valued codepoints in - UTF-8 mode. The horizontal space characters are: - - U+0009 Horizontal tab - U+0020 Space - U+00A0 Non-break space - U+1680 Ogham space mark - U+180E Mongolian vowel separator - U+2000 En quad - U+2001 Em quad - U+2002 En space - U+2003 Em space - U+2004 Three-per-em space - U+2005 Four-per-em space - U+2006 Six-per-em space - U+2007 Figure space - U+2008 Punctuation space - U+2009 Thin space - U+200A Hair space - U+202F Narrow no-break space - U+205F Medium mathematical space - U+3000 Ideographic space - - The vertical space characters are: - - U+000A Linefeed - U+000B Vertical tab - U+000C Formfeed - U+000D Carriage return - U+0085 Next line - U+2028 Line separator - U+2029 Paragraph separator - - A "word" character is an underscore or any character less than 256 that - is a letter or digit. The definition of letters and digits is con- - trolled by PCRE's low-valued character tables, and may vary if locale- - specific matching is taking place (see "Locale support" in the pcreapi - page). For example, in a French locale such as "fr_FR" in Unix-like - systems, or "french" in Windows, some character codes greater than 128 - are used for accented letters, and these are matched by \w. The use of - locales with Unicode is discouraged. - - Newline sequences - - Outside a character class, the escape sequence \R matches any Unicode - newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is - equivalent to the following: - - (?>\r\n|\n|\x0b|\f|\r|\x85) - - This is an example of an "atomic group", details of which are given - below. This particular group matches either the two-character sequence - CR followed by LF, or one of the single characters LF (linefeed, - U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage - return, U+000D), or NEL (next line, U+0085). The two-character sequence - is treated as a single unit that cannot be split. - - In UTF-8 mode, two additional characters whose codepoints are greater - than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- - rator, U+2029). Unicode character property support is not needed for - these characters to be recognized. - - Inside a character class, \R matches the letter "R". - - Unicode character properties - - When PCRE is built with Unicode character property support, three addi- - tional escape sequences that match characters with specific properties - are available. When not in UTF-8 mode, these sequences are of course - limited to testing characters whose codepoints are less than 256, but - they do work in this mode. The extra escape sequences are: - - \p{xx} a character with the xx property - \P{xx} a character without the xx property - \X an extended Unicode sequence - - The property names represented by xx above are limited to the Unicode - script names, the general category properties, and "Any", which matches - any character (including newline). Other properties such as "InMusical- - Symbols" are not currently supported by PCRE. Note that \P{Any} does - not match any characters, so always causes a match failure. - - Sets of Unicode characters are defined as belonging to certain scripts. - A character from one of these sets can be matched using a script name. - For example: - - \p{Greek} - \P{Han} - - Those that are not part of an identified script are lumped together as - "Common". The current list of scripts is: - - Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, - Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, - Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, - Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- - gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, - Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, - Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, - Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, - Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. - - Each character has exactly one general category property, specified by - a two-letter abbreviation. For compatibility with Perl, negation can be - specified by including a circumflex between the opening brace and the - property name. For example, \p{^Lu} is the same as \P{Lu}. - - If only one letter is specified with \p or \P, it includes all the gen- - eral category properties that start with that letter. In this case, in - the absence of negation, the curly brackets in the escape sequence are - optional; these two examples have the same effect: - - \p{L} - \pL - - The following general category property codes are supported: - - C Other - Cc Control - Cf Format - Cn Unassigned - Co Private use - Cs Surrogate - - L Letter - Ll Lower case letter - Lm Modifier letter - Lo Other letter - Lt Title case letter - Lu Upper case letter - - M Mark - Mc Spacing mark - Me Enclosing mark - Mn Non-spacing mark - - N Number - Nd Decimal number - Nl Letter number - No Other number - - P Punctuation - Pc Connector punctuation - Pd Dash punctuation - Pe Close punctuation - Pf Final punctuation - Pi Initial punctuation - Po Other punctuation - Ps Open punctuation - - S Symbol - Sc Currency symbol - Sk Modifier symbol - Sm Mathematical symbol - So Other symbol - - Z Separator - Zl Line separator - Zp Paragraph separator - Zs Space separator - - The special property L& is also supported: it matches a character that - has the Lu, Ll, or Lt property, in other words, a letter that is not - classified as a modifier or "other". - - The long synonyms for these properties that Perl supports (such as - \p{Letter}) are not supported by PCRE, nor is it permitted to prefix - any of these properties with "Is". - - No character that is in the Unicode table has the Cn (unassigned) prop- - erty. Instead, this property is assumed for any code point that is not - in the Unicode table. - - Specifying caseless matching does not affect these escape sequences. - For example, \p{Lu} always matches only upper case letters. - - The \X escape matches any number of Unicode characters that form an - extended Unicode sequence. \X is equivalent to - - (?>\PM\pM*) - - That is, it matches a character without the "mark" property, followed - by zero or more characters with the "mark" property, and treats the - sequence as an atomic group (see below). Characters with the "mark" - property are typically accents that affect the preceding character. - None of them have codepoints less than 256, so in non-UTF-8 mode \X - matches any one character. - - Matching characters by Unicode property is not fast, because PCRE has - to search a structure that contains data for over fifteen thousand - characters. That is why the traditional escape sequences such as \d and - \w do not use Unicode properties in PCRE. - - Resetting the match start - - The escape sequence \K, which is a Perl 5.10 feature, causes any previ- - ously matched characters not to be included in the final matched - sequence. For example, the pattern: - - foo\Kbar - - matches "foobar", but reports that it has matched "bar". This feature - is similar to a lookbehind assertion (described below). However, in - this case, the part of the subject before the real match does not have - to be of fixed length, as lookbehind assertions do. The use of \K does - not interfere with the setting of captured substrings. For example, - when the pattern - - (foo)\Kbar - - matches "foobar", the first substring is still set to "foo". - - Simple assertions - - The final use of backslash is for certain simple assertions. An asser- - tion specifies a condition that has to be met at a particular point in - a match, without consuming any characters from the subject string. The - use of subpatterns for more complicated assertions is described below. - The backslashed assertions are: - - \b matches at a word boundary - \B matches when not at a word boundary - \A matches at the start of the subject - \Z matches at the end of the subject - also matches before a newline at the end of the subject - \z matches only at the end of the subject - \G matches at the first matching position in the subject - - These assertions may not appear in character classes (but note that \b - has a different meaning, namely the backspace character, inside a char- - acter class). - - A word boundary is a position in the subject string where the current - character and the previous character do not both match \w or \W (i.e. - one matches \w and the other matches \W), or the start or end of the - string if the first or last character matches \w, respectively. - - The \A, \Z, and \z assertions differ from the traditional circumflex - and dollar (described in the next section) in that they only ever match - at the very start and end of the subject string, whatever options are - set. Thus, they are independent of multiline mode. These three asser- - tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which - affect only the behaviour of the circumflex and dollar metacharacters. - However, if the startoffset argument of pcre_exec() is non-zero, indi- - cating that matching is to start at a point other than the beginning of - the subject, \A can never match. The difference between \Z and \z is - that \Z matches before a newline at the end of the string as well as at - the very end, whereas \z matches only at the end. - - The \G assertion is true only when the current matching position is at - the start point of the match, as specified by the startoffset argument - of pcre_exec(). It differs from \A when the value of startoffset is - non-zero. By calling pcre_exec() multiple times with appropriate argu- - ments, you can mimic Perl's /g option, and it is in this kind of imple- - mentation where \G can be useful. - - Note, however, that PCRE's interpretation of \G, as the start of the - current match, is subtly different from Perl's, which defines it as the - end of the previous match. In Perl, these can be different when the - previously matched string was empty. Because PCRE does just one match - at a time, it cannot reproduce this behaviour. - - If all the alternatives of a pattern begin with \G, the expression is - anchored to the starting match position, and the "anchored" flag is set - in the compiled regular expression. - - -CIRCUMFLEX AND DOLLAR - - Outside a character class, in the default matching mode, the circumflex - character is an assertion that is true only if the current matching - point is at the start of the subject string. If the startoffset argu- - ment of pcre_exec() is non-zero, circumflex can never match if the - PCRE_MULTILINE option is unset. Inside a character class, circumflex - has an entirely different meaning (see below). - - Circumflex need not be the first character of the pattern if a number - of alternatives are involved, but it should be the first thing in each - alternative in which it appears if the pattern is ever to match that - branch. If all possible alternatives start with a circumflex, that is, - if the pattern is constrained to match only at the start of the sub- - ject, it is said to be an "anchored" pattern. (There are also other - constructs that can cause a pattern to be anchored.) - - A dollar character is an assertion that is true only if the current - matching point is at the end of the subject string, or immediately - before a newline at the end of the string (by default). Dollar need not - be the last character of the pattern if a number of alternatives are - involved, but it should be the last item in any branch in which it - appears. Dollar has no special meaning in a character class. - - The meaning of dollar can be changed so that it matches only at the - very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at - compile time. This does not affect the \Z assertion. - - The meanings of the circumflex and dollar characters are changed if the - PCRE_MULTILINE option is set. When this is the case, a circumflex - matches immediately after internal newlines as well as at the start of - the subject string. It does not match after a newline that ends the - string. A dollar matches before any newlines in the string, as well as - at the very end, when PCRE_MULTILINE is set. When newline is specified - as the two-character sequence CRLF, isolated CR and LF characters do - not indicate newlines. - - For example, the pattern /^abc$/ matches the subject string "def\nabc" - (where \n represents a newline) in multiline mode, but not otherwise. - Consequently, patterns that are anchored in single line mode because - all branches start with ^ are not anchored in multiline mode, and a - match for circumflex is possible when the startoffset argument of - pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if - PCRE_MULTILINE is set. - - Note that the sequences \A, \Z, and \z can be used to match the start - and end of the subject in both modes, and if all branches of a pattern - start with \A it is always anchored, whether or not PCRE_MULTILINE is - set. - - -FULL STOP (PERIOD, DOT) - - Outside a character class, a dot in the pattern matches any one charac- - ter in the subject string except (by default) a character that signi- - fies the end of a line. In UTF-8 mode, the matched character may be - more than one byte long. - - When a line ending is defined as a single character, dot never matches - that character; when the two-character sequence CRLF is used, dot does - not match CR if it is immediately followed by LF, but otherwise it - matches all characters (including isolated CRs and LFs). When any Uni- - code line endings are being recognized, dot does not match CR or LF or - any of the other line ending characters. - - The behaviour of dot with regard to newlines can be changed. If the - PCRE_DOTALL option is set, a dot matches any one character, without - exception. If the two-character sequence CRLF is present in the subject - string, it takes two dots to match it. - - The handling of dot is entirely independent of the handling of circum- - flex and dollar, the only relationship being that they both involve - newlines. Dot has no special meaning in a character class. - - -MATCHING A SINGLE BYTE - - Outside a character class, the escape sequence \C matches any one byte, - both in and out of UTF-8 mode. Unlike a dot, it always matches any - line-ending characters. The feature is provided in Perl in order to - match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- - acters into individual bytes, what remains in the string may be a mal- - formed UTF-8 string. For this reason, the \C escape sequence is best - avoided. - - PCRE does not allow \C to appear in lookbehind assertions (described - below), because in UTF-8 mode this would make it impossible to calcu- - late the length of the lookbehind. - - -SQUARE BRACKETS AND CHARACTER CLASSES - - An opening square bracket introduces a character class, terminated by a - closing square bracket. A closing square bracket on its own is not spe- - cial. If a closing square bracket is required as a member of the class, - it should be the first data character in the class (after an initial - circumflex, if present) or escaped with a backslash. - - A character class matches a single character in the subject. In UTF-8 - mode, the character may occupy more than one byte. A matched character - must be in the set of characters defined by the class, unless the first - character in the class definition is a circumflex, in which case the - subject character must not be in the set defined by the class. If a - circumflex is actually required as a member of the class, ensure it is - not the first character, or escape it with a backslash. - - For example, the character class [aeiou] matches any lower case vowel, - while [^aeiou] matches any character that is not a lower case vowel. - Note that a circumflex is just a convenient notation for specifying the - characters that are in the class by enumerating those that are not. A - class that starts with a circumflex is not an assertion: it still con- - sumes a character from the subject string, and therefore it fails if - the current pointer is at the end of the string. - - In UTF-8 mode, characters with values greater than 255 can be included - in a class as a literal string of bytes, or by using the \x{ escaping - mechanism. - - When caseless matching is set, any letters in a class represent both - their upper case and lower case versions, so for example, a caseless - [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not - match "A", whereas a caseful version would. In UTF-8 mode, PCRE always - understands the concept of case for characters whose values are less - than 128, so caseless matching is always possible. For characters with - higher values, the concept of case is supported if PCRE is compiled - with Unicode property support, but not otherwise. If you want to use - caseless matching for characters 128 and above, you must ensure that - PCRE is compiled with Unicode property support as well as with UTF-8 - support. - - Characters that might indicate line breaks are never treated in any - special way when matching character classes, whatever line-ending - sequence is in use, and whatever setting of the PCRE_DOTALL and - PCRE_MULTILINE options is used. A class such as [^a] always matches one - of these characters. - - The minus (hyphen) character can be used to specify a range of charac- - ters in a character class. For example, [d-m] matches any letter - between d and m, inclusive. If a minus character is required in a - class, it must be escaped with a backslash or appear in a position - where it cannot be interpreted as indicating a range, typically as the - first or last character in the class. - - It is not possible to have the literal character "]" as the end charac- - ter of a range. A pattern such as [W-]46] is interpreted as a class of - two characters ("W" and "-") followed by a literal string "46]", so it - would match "W46]" or "-46]". However, if the "]" is escaped with a - backslash it is interpreted as the end of range, so [W-\]46] is inter- - preted as a class containing a range followed by two other characters. - The octal or hexadecimal representation of "]" can also be used to end - a range. - - Ranges operate in the collating sequence of character values. They can - also be used for characters specified numerically, for example - [\000-\037]. In UTF-8 mode, ranges can include characters whose values - are greater than 255, for example [\x{100}-\x{2ff}]. - - If a range that includes letters is used when caseless matching is set, - it matches the letters in either case. For example, [W-c] is equivalent - to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if - character tables for a French locale are in use, [\xc8-\xcb] matches - accented E characters in both cases. In UTF-8 mode, PCRE supports the - concept of case for characters with values greater than 128 only when - it is compiled with Unicode property support. - - The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear - in a character class, and add the characters that they match to the - class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- - flex can conveniently be used with the upper case character types to - specify a more restricted set of characters than the matching lower - case type. For example, the class [^\W_] matches any letter or digit, - but not underscore. - - The only metacharacters that are recognized in character classes are - backslash, hyphen (only where it can be interpreted as specifying a - range), circumflex (only at the start), opening square bracket (only - when it can be interpreted as introducing a POSIX class name - see the - next section), and the terminating closing square bracket. However, - escaping other non-alphanumeric characters does no harm. - - -POSIX CHARACTER CLASSES - - Perl supports the POSIX notation for character classes. This uses names - enclosed by [: and :] within the enclosing square brackets. PCRE also - supports this notation. For example, - - [01[:alpha:]%] - - matches "0", "1", any alphabetic character, or "%". The supported class - names are - - alnum letters and digits - alpha letters - ascii character codes 0 - 127 - blank space or tab only - cntrl control characters - digit decimal digits (same as \d) - graph printing characters, excluding space - lower lower case letters - print printing characters, including space - punct printing characters, excluding letters and digits - space white space (not quite the same as \s) - upper upper case letters - word "word" characters (same as \w) - xdigit hexadecimal digits - - The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), - and space (32). Notice that this list includes the VT character (code - 11). This makes "space" different to \s, which does not include VT (for - Perl compatibility). - - The name "word" is a Perl extension, and "blank" is a GNU extension - from Perl 5.8. Another Perl extension is negation, which is indicated - by a ^ character after the colon. For example, - - [12[:^digit:]] - - matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the - POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but - these are not supported, and an error is given if they are encountered. - - In UTF-8 mode, characters with values greater than 128 do not match any - of the POSIX character classes. - - -VERTICAL BAR - - Vertical bar characters are used to separate alternative patterns. For - example, the pattern - - gilbert|sullivan - - matches either "gilbert" or "sullivan". Any number of alternatives may - appear, and an empty alternative is permitted (matching the empty - string). The matching process tries each alternative in turn, from left - to right, and the first one that succeeds is used. If the alternatives - are within a subpattern (defined below), "succeeds" means matching the - rest of the main pattern as well as the alternative in the subpattern. - - -INTERNAL OPTION SETTING - - The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and - PCRE_EXTENDED options can be changed from within the pattern by a - sequence of Perl option letters enclosed between "(?" and ")". The - option letters are - - i for PCRE_CASELESS - m for PCRE_MULTILINE - s for PCRE_DOTALL - x for PCRE_EXTENDED - - For example, (?im) sets caseless, multiline matching. It is also possi- - ble to unset these options by preceding the letter with a hyphen, and a - combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- - LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, - is also permitted. If a letter appears both before and after the - hyphen, the option is unset. - - When an option change occurs at top level (that is, not inside subpat- - tern parentheses), the change applies to the remainder of the pattern - that follows. If the change is placed right at the start of a pattern, - PCRE extracts it into the global options (and it will therefore show up - in data extracted by the pcre_fullinfo() function). - - An option change within a subpattern (see below for a description of - subpatterns) affects only that part of the current pattern that follows - it, so - - (a(?i)b)c - - matches abc and aBc and no other strings (assuming PCRE_CASELESS is not - used). By this means, options can be made to have different settings - in different parts of the pattern. Any changes made in one alternative - do carry on into subsequent branches within the same subpattern. For - example, - - (a(?i)b|c) - - matches "ab", "aB", "c", and "C", even though when matching "C" the - first branch is abandoned before the option setting. This is because - the effects of option settings happen at compile time. There would be - some very weird behaviour otherwise. - - The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA - can be changed in the same way as the Perl-compatible options by using - the characters J, U and X respectively. - - -SUBPATTERNS - - Subpatterns are delimited by parentheses (round brackets), which can be - nested. Turning part of a pattern into a subpattern does two things: - - 1. It localizes a set of alternatives. For example, the pattern - - cat(aract|erpillar|) - - matches one of the words "cat", "cataract", or "caterpillar". Without - the parentheses, it would match "cataract", "erpillar" or an empty - string. - - 2. It sets up the subpattern as a capturing subpattern. This means - that, when the whole pattern matches, that portion of the subject - string that matched the subpattern is passed back to the caller via the - ovector argument of pcre_exec(). Opening parentheses are counted from - left to right (starting from 1) to obtain numbers for the capturing - subpatterns. - - For example, if the string "the red king" is matched against the pat- - tern - - the ((red|white) (king|queen)) - - the captured substrings are "red king", "red", and "king", and are num- - bered 1, 2, and 3, respectively. - - The fact that plain parentheses fulfil two functions is not always - helpful. There are often times when a grouping subpattern is required - without a capturing requirement. If an opening parenthesis is followed - by a question mark and a colon, the subpattern does not do any captur- - ing, and is not counted when computing the number of any subsequent - capturing subpatterns. For example, if the string "the white queen" is - matched against the pattern - - the ((?:red|white) (king|queen)) - - the captured substrings are "white queen" and "queen", and are numbered - 1 and 2. The maximum number of capturing subpatterns is 65535. - - As a convenient shorthand, if any option settings are required at the - start of a non-capturing subpattern, the option letters may appear - between the "?" and the ":". Thus the two patterns - - (?i:saturday|sunday) - (?:(?i)saturday|sunday) - - match exactly the same set of strings. Because alternative branches are - tried from left to right, and options are not reset until the end of - the subpattern is reached, an option setting in one branch does affect - subsequent branches, so the above patterns match "SUNDAY" as well as - "Saturday". - - -DUPLICATE SUBPATTERN NUMBERS - - Perl 5.10 introduced a feature whereby each alternative in a subpattern - uses the same numbers for its capturing parentheses. Such a subpattern - starts with (?| and is itself a non-capturing subpattern. For example, - consider this pattern: - - (?|(Sat)ur|(Sun))day - - Because the two alternatives are inside a (?| group, both sets of cap- - turing parentheses are numbered one. Thus, when the pattern matches, - you can look at captured substring number one, whichever alternative - matched. This construct is useful when you want to capture part, but - not all, of one of a number of alternatives. Inside a (?| group, paren- - theses are numbered as usual, but the number is reset at the start of - each branch. The numbers of any capturing buffers that follow the sub- - pattern start after the highest number used in any branch. The follow- - ing example is taken from the Perl documentation. The numbers under- - neath show in which buffer the captured content will be stored. - - # before ---------------branch-reset----------- after - / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x - # 1 2 2 3 2 3 4 - - A backreference or a recursive call to a numbered subpattern always - refers to the first one in the pattern with the given number. - - An alternative approach to using this "branch reset" feature is to use - duplicate named subpatterns, as described in the next section. - - -NAMED SUBPATTERNS - - Identifying capturing parentheses by number is simple, but it can be - very hard to keep track of the numbers in complicated regular expres- - sions. Furthermore, if an expression is modified, the numbers may - change. To help with this difficulty, PCRE supports the naming of sub- - patterns. This feature was not added to Perl until release 5.10. Python - had the feature earlier, and PCRE introduced it at release 4.0, using - the Python syntax. PCRE now supports both the Perl and the Python syn- - tax. - - In PCRE, a subpattern can be named in one of three ways: (?<name>...) - or (?'name'...) as in Perl, or (?P<name>...) as in Python. References - to capturing parentheses from other parts of the pattern, such as back- - references, recursion, and conditions, can be made by name as well as - by number. - - Names consist of up to 32 alphanumeric characters and underscores. - Named capturing parentheses are still allocated numbers as well as - names, exactly as if the names were not present. The PCRE API provides - function calls for extracting the name-to-number translation table from - a compiled pattern. There is also a convenience function for extracting - a captured substring by name. - - By default, a name must be unique within a pattern, but it is possible - to relax this constraint by setting the PCRE_DUPNAMES option at compile - time. This can be useful for patterns where only one instance of the - named parentheses can match. Suppose you want to match the name of a - weekday, either as a 3-letter abbreviation or as the full name, and in - both cases you want to extract the abbreviation. This pattern (ignoring - the line breaks) does the job: - - (?<DN>Mon|Fri|Sun)(?:day)?| - (?<DN>Tue)(?:sday)?| - (?<DN>Wed)(?:nesday)?| - (?<DN>Thu)(?:rsday)?| - (?<DN>Sat)(?:urday)? - - There are five capturing substrings, but only one is ever set after a - match. (An alternative way of solving this problem is to use a "branch - reset" subpattern, as described in the previous section.) - - The convenience function for extracting the data by name returns the - substring for the first (and in this example, the only) subpattern of - that name that matched. This saves searching to find which numbered - subpattern it was. If you make a reference to a non-unique named sub- - pattern from elsewhere in the pattern, the one that corresponds to the - lowest number is used. For further details of the interfaces for han- - dling named subpatterns, see the pcreapi documentation. - - -REPETITION - - Repetition is specified by quantifiers, which can follow any of the - following items: - - a literal data character - the dot metacharacter - the \C escape sequence - the \X escape sequence (in UTF-8 mode with Unicode properties) - the \R escape sequence - an escape such as \d that matches a single character - a character class - a back reference (see next section) - a parenthesized subpattern (unless it is an assertion) - - The general repetition quantifier specifies a minimum and maximum num- - ber of permitted matches, by giving the two numbers in curly brackets - (braces), separated by a comma. The numbers must be less than 65536, - and the first must be less than or equal to the second. For example: - - z{2,4} - - matches "zz", "zzz", or "zzzz". A closing brace on its own is not a - special character. If the second number is omitted, but the comma is - present, there is no upper limit; if the second number and the comma - are both omitted, the quantifier specifies an exact number of required - matches. Thus - - [aeiou]{3,} - - matches at least 3 successive vowels, but may match many more, while - - \d{8} - - matches exactly 8 digits. An opening curly bracket that appears in a - position where a quantifier is not allowed, or one that does not match - the syntax of a quantifier, is taken as a literal character. For exam- - ple, {,6} is not a quantifier, but a literal string of four characters. - - In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to - individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- - acters, each of which is represented by a two-byte sequence. Similarly, - when Unicode property support is available, \X{3} matches three Unicode - extended sequences, each of which may be several bytes long (and they - may be of different lengths). - - The quantifier {0} is permitted, causing the expression to behave as if - the previous item and the quantifier were not present. - - For convenience, the three most common quantifiers have single-charac- - ter abbreviations: - - * is equivalent to {0,} - + is equivalent to {1,} - ? is equivalent to {0,1} - - It is possible to construct infinite loops by following a subpattern - that can match no characters with a quantifier that has no upper limit, - for example: - - (a?)* - - Earlier versions of Perl and PCRE used to give an error at compile time - for such patterns. However, because there are cases where this can be - useful, such patterns are now accepted, but if any repetition of the - subpattern does in fact match no characters, the loop is forcibly bro- - ken. - - By default, the quantifiers are "greedy", that is, they match as much - as possible (up to the maximum number of permitted times), without - causing the rest of the pattern to fail. The classic example of where - this gives problems is in trying to match comments in C programs. These - appear between /* and */ and within the comment, individual * and / - characters may appear. An attempt to match C comments by applying the - pattern - - /\*.*\*/ - - to the string - - /* first comment */ not comment /* second comment */ - - fails, because it matches the entire string owing to the greediness of - the .* item. - - However, if a quantifier is followed by a question mark, it ceases to - be greedy, and instead matches the minimum number of times possible, so - the pattern - - /\*.*?\*/ - - does the right thing with the C comments. The meaning of the various - quantifiers is not otherwise changed, just the preferred number of - matches. Do not confuse this use of question mark with its use as a - quantifier in its own right. Because it has two uses, it can sometimes - appear doubled, as in - - \d??\d - - which matches one digit by preference, but can match two if that is the - only way the rest of the pattern matches. - - If the PCRE_UNGREEDY option is set (an option that is not available in - Perl), the quantifiers are not greedy by default, but individual ones - can be made greedy by following them with a question mark. In other - words, it inverts the default behaviour. - - When a parenthesized subpattern is quantified with a minimum repeat - count that is greater than 1 or with a limited maximum, more memory is - required for the compiled pattern, in proportion to the size of the - minimum or maximum. - - If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- - alent to Perl's /s) is set, thus allowing the dot to match newlines, - the pattern is implicitly anchored, because whatever follows will be - tried against every character position in the subject string, so there - is no point in retrying the overall match at any position after the - first. PCRE normally treats such a pattern as though it were preceded - by \A. - - In cases where it is known that the subject string contains no new- - lines, it is worth setting PCRE_DOTALL in order to obtain this opti- - mization, or alternatively using ^ to indicate anchoring explicitly. - - However, there is one situation where the optimization cannot be used. - When .* is inside capturing parentheses that are the subject of a - backreference elsewhere in the pattern, a match at the start may fail - where a later one succeeds. Consider, for example: - - (.*)abc\1 - - If the subject is "xyz123abc123" the match point is the fourth charac- - ter. For this reason, such a pattern is not implicitly anchored. - - When a capturing subpattern is repeated, the value captured is the sub- - string that matched the final iteration. For example, after - - (tweedle[dume]{3}\s*)+ - - has matched "tweedledum tweedledee" the value of the captured substring - is "tweedledee". However, if there are nested capturing subpatterns, - the corresponding captured values may have been set in previous itera- - tions. For example, after - - /(a|(b))+/ - - matches "aba" the value of the second captured substring is "b". - - -ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS - - With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") - repetition, failure of what follows normally causes the repeated item - to be re-evaluated to see if a different number of repeats allows the - rest of the pattern to match. Sometimes it is useful to prevent this, - either to change the nature of the match, or to cause it fail earlier - than it otherwise might, when the author of the pattern knows there is - no point in carrying on. - - Consider, for example, the pattern \d+foo when applied to the subject - line - - 123456bar - - After matching all 6 digits and then failing to match "foo", the normal - action of the matcher is to try again with only 5 digits matching the - \d+ item, and then with 4, and so on, before ultimately failing. - "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides - the means for specifying that once a subpattern has matched, it is not - to be re-evaluated in this way. - - If we use atomic grouping for the previous example, the matcher gives - up immediately on failing to match "foo" the first time. The notation - is a kind of special parenthesis, starting with (?> as in this example: - - (?>\d+)foo - - This kind of parenthesis "locks up" the part of the pattern it con- - tains once it has matched, and a failure further into the pattern is - prevented from backtracking into it. Backtracking past it to previous - items, however, works as normal. - - An alternative description is that a subpattern of this type matches - the string of characters that an identical standalone pattern would - match, if anchored at the current point in the subject string. - - Atomic grouping subpatterns are not capturing subpatterns. Simple cases - such as the above example can be thought of as a maximizing repeat that - must swallow everything it can. So, while both \d+ and \d+? are pre- - pared to adjust the number of digits they match in order to make the - rest of the pattern match, (?>\d+) can only match an entire sequence of - digits. - - Atomic groups in general can of course contain arbitrarily complicated - subpatterns, and can be nested. However, when the subpattern for an - atomic group is just a single repeated item, as in the example above, a - simpler notation, called a "possessive quantifier" can be used. This - consists of an additional + character following a quantifier. Using - this notation, the previous example can be rewritten as - - \d++foo - - Possessive quantifiers are always greedy; the setting of the - PCRE_UNGREEDY option is ignored. They are a convenient notation for the - simpler forms of atomic group. However, there is no difference in the - meaning of a possessive quantifier and the equivalent atomic group, - though there may be a performance difference; possessive quantifiers - should be slightly faster. - - The possessive quantifier syntax is an extension to the Perl 5.8 syn- - tax. Jeffrey Friedl originated the idea (and the name) in the first - edition of his book. Mike McCloskey liked it, so implemented it when he - built Sun's Java package, and PCRE copied it from there. It ultimately - found its way into Perl at release 5.10. - - PCRE has an optimization that automatically "possessifies" certain sim- - ple pattern constructs. For example, the sequence A+B is treated as - A++B because there is no point in backtracking into a sequence of A's - when B must follow. - - When a pattern contains an unlimited repeat inside a subpattern that - can itself be repeated an unlimited number of times, the use of an - atomic group is the only way to avoid some failing matches taking a - very long time indeed. The pattern - - (\D+|<\d+>)*[!?] - - matches an unlimited number of substrings that either consist of non- - digits, or digits enclosed in <>, followed by either ! or ?. When it - matches, it runs quickly. However, if it is applied to - - aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa - - it takes a long time before reporting failure. This is because the - string can be divided between the internal \D+ repeat and the external - * repeat in a large number of ways, and all have to be tried. (The - example uses [!?] rather than a single character at the end, because - both PCRE and Perl have an optimization that allows for fast failure - when a single character is used. They remember the last single charac- - ter that is required for a match, and fail early if it is not present - in the string.) If the pattern is changed so that it uses an atomic - group, like this: - - ((?>\D+)|<\d+>)*[!?] - - sequences of non-digits cannot be broken, and failure happens quickly. - - -BACK REFERENCES - - Outside a character class, a backslash followed by a digit greater than - 0 (and possibly further digits) is a back reference to a capturing sub- - pattern earlier (that is, to its left) in the pattern, provided there - have been that many previous capturing left parentheses. - - However, if the decimal number following the backslash is less than 10, - it is always taken as a back reference, and causes an error only if - there are not that many capturing left parentheses in the entire pat- - tern. In other words, the parentheses that are referenced need not be - to the left of the reference for numbers less than 10. A "forward back - reference" of this type can make sense when a repetition is involved - and the subpattern to the right has participated in an earlier itera- - tion. - - It is not possible to have a numerical "forward back reference" to a - subpattern whose number is 10 or more using this syntax because a - sequence such as \50 is interpreted as a character defined in octal. - See the subsection entitled "Non-printing characters" above for further - details of the handling of digits following a backslash. There is no - such problem when named parentheses are used. A back reference to any - subpattern is possible using named parentheses (see below). - - Another way of avoiding the ambiguity inherent in the use of digits - following a backslash is to use the \g escape sequence, which is a fea- - ture introduced in Perl 5.10. This escape must be followed by a posi- - tive or a negative number, optionally enclosed in braces. These exam- - ples are all identical: - - (ring), \1 - (ring), \g1 - (ring), \g{1} - - A positive number specifies an absolute reference without the ambiguity - that is present in the older syntax. It is also useful when literal - digits follow the reference. A negative number is a relative reference. - Consider this example: - - (abc(def)ghi)\g{-1} - - The sequence \g{-1} is a reference to the most recently started captur- - ing subpattern before \g, that is, is it equivalent to \2. Similarly, - \g{-2} would be equivalent to \1. The use of relative references can be - helpful in long patterns, and also in patterns that are created by - joining together fragments that contain references within themselves. - - A back reference matches whatever actually matched the capturing sub- - pattern in the current subject string, rather than anything matching - the subpattern itself (see "Subpatterns as subroutines" below for a way - of doing that). So the pattern - - (sens|respons)e and \1ibility - - matches "sense and sensibility" and "response and responsibility", but - not "sense and responsibility". If caseful matching is in force at the - time of the back reference, the case of letters is relevant. For exam- - ple, - - ((?i)rah)\s+\1 - - matches "rah rah" and "RAH RAH", but not "RAH rah", even though the - original capturing subpattern is matched caselessly. - - There are several different ways of writing back references to named - subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or - \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's - unified back reference syntax, in which \g can be used for both numeric - and named references, is also supported. We could rewrite the above - example in any of the following ways: - - (?<p1>(?i)rah)\s+\k<p1> - (?'p1'(?i)rah)\s+\k{p1} - (?P<p1>(?i)rah)\s+(?P=p1) - (?<p1>(?i)rah)\s+\g{p1} - - A subpattern that is referenced by name may appear in the pattern - before or after the reference. - - There may be more than one back reference to the same subpattern. If a - subpattern has not actually been used in a particular match, any back - references to it always fail. For example, the pattern - - (a|(bc))\2 - - always fails if it starts to match "a" rather than "bc". Because there - may be many capturing parentheses in a pattern, all digits following - the backslash are taken as part of a potential back reference number. - If the pattern continues with a digit character, some delimiter must be - used to terminate the back reference. If the PCRE_EXTENDED option is - set, this can be whitespace. Otherwise an empty comment (see "Com- - ments" below) can be used. - - A back reference that occurs inside the parentheses to which it refers - fails when the subpattern is first used, so, for example, (a\1) never - matches. However, such references can be useful inside repeated sub- - patterns. For example, the pattern - - (a|b\1)+ - - matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- - ation of the subpattern, the back reference matches the character - string corresponding to the previous iteration. In order for this to - work, the pattern must be such that the first iteration does not need - to match the back reference. This can be done using alternation, as in - the example above, or by a quantifier with a minimum of zero. - - -ASSERTIONS - - An assertion is a test on the characters following or preceding the - current matching point that does not actually consume any characters. - The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are - described above. - - More complicated assertions are coded as subpatterns. There are two - kinds: those that look ahead of the current position in the subject - string, and those that look behind it. An assertion subpattern is - matched in the normal way, except that it does not cause the current - matching position to be changed. - - Assertion subpatterns are not capturing subpatterns, and may not be - repeated, because it makes no sense to assert the same thing several - times. If any kind of assertion contains capturing subpatterns within - it, these are counted for the purposes of numbering the capturing sub- - patterns in the whole pattern. However, substring capturing is carried - out only for positive assertions, because it does not make sense for - negative assertions. - - Lookahead assertions - - Lookahead assertions start with (?= for positive assertions and (?! for - negative assertions. For example, - - \w+(?=;) - - matches a word followed by a semicolon, but does not include the semi- - colon in the match, and - - foo(?!bar) - - matches any occurrence of "foo" that is not followed by "bar". Note - that the apparently similar pattern - - (?!foo)bar - - does not find an occurrence of "bar" that is preceded by something - other than "foo"; it finds any occurrence of "bar" whatsoever, because - the assertion (?!foo) is always true when the next three characters are - "bar". A lookbehind assertion is needed to achieve the other effect. - - If you want to force a matching failure at some point in a pattern, the - most convenient way to do it is with (?!) because an empty string - always matches, so an assertion that requires there not to be an empty - string must always fail. - - Lookbehind assertions - - Lookbehind assertions start with (?<= for positive assertions and (?<! - for negative assertions. For example, - - (?<!foo)bar - - does find an occurrence of "bar" that is not preceded by "foo". The - contents of a lookbehind assertion are restricted such that all the - strings it matches must have a fixed length. However, if there are sev- - eral top-level alternatives, they do not all have to have the same - fixed length. Thus - - (?<=bullock|donkey) - - is permitted, but - - (?<!dogs?|cats?) - - causes an error at compile time. Branches that match different length - strings are permitted only at the top level of a lookbehind assertion. - This is an extension compared with Perl (at least for 5.8), which - requires all branches to match the same length of string. An assertion - such as - - (?<=ab(c|de)) - - is not permitted, because its single top-level branch can match two - different lengths, but it is acceptable if rewritten to use two top- - level branches: - - (?<=abc|abde) - - In some cases, the Perl 5.10 escape sequence \K (see above) can be used - instead of a lookbehind assertion; this is not restricted to a fixed- - length. - - The implementation of lookbehind assertions is, for each alternative, - to temporarily move the current position back by the fixed length and - then try to match. If there are insufficient characters before the cur- - rent position, the assertion fails. - - PCRE does not allow the \C escape (which matches a single byte in UTF-8 - mode) to appear in lookbehind assertions, because it makes it impossi- - ble to calculate the length of the lookbehind. The \X and \R escapes, - which can match different numbers of bytes, are also not permitted. - - Possessive quantifiers can be used in conjunction with lookbehind - assertions to specify efficient matching at the end of the subject - string. Consider a simple pattern such as - - abcd$ - - when applied to a long string that does not match. Because matching - proceeds from left to right, PCRE will look for each "a" in the subject - and then see if what follows matches the rest of the pattern. If the - pattern is specified as - - ^.*abcd$ - - the initial .* matches the entire string at first, but when this fails - (because there is no following "a"), it backtracks to match all but the - last character, then all but the last two characters, and so on. Once - again the search for "a" covers the entire string, from right to left, - so we are no better off. However, if the pattern is written as - - ^.*+(?<=abcd) - - there can be no backtracking for the .*+ item; it can match only the - entire string. The subsequent lookbehind assertion does a single test - on the last four characters. If it fails, the match fails immediately. - For long strings, this approach makes a significant difference to the - processing time. - - Using multiple assertions - - Several assertions (of any sort) may occur in succession. For example, - - (?<=\d{3})(?<!999)foo - - matches "foo" preceded by three digits that are not "999". Notice that - each of the assertions is applied independently at the same point in - the subject string. First there is a check that the previous three - characters are all digits, and then there is a check that the same - three characters are not "999". This pattern does not match "foo" pre- - ceded by six characters, the first of which are digits and the last - three of which are not "999". For example, it doesn't match "123abc- - foo". A pattern to do that is - - (?<=\d{3}...)(?<!999)foo - - This time the first assertion looks at the preceding six characters, - checking that the first three are digits, and then the second assertion - checks that the preceding three characters are not "999". - - Assertions can be nested in any combination. For example, - - (?<=(?<!foo)bar)baz - - matches an occurrence of "baz" that is preceded by "bar" which in turn - is not preceded by "foo", while - - (?<=\d{3}(?!999)...)foo - - is another pattern that matches "foo" preceded by three digits and any - three characters that are not "999". - - -CONDITIONAL SUBPATTERNS - - It is possible to cause the matching process to obey a subpattern con- - ditionally or to choose between two alternative subpatterns, depending - on the result of an assertion, or whether a previous capturing subpat- - tern matched or not. The two possible forms of conditional subpattern - are - - (?(condition)yes-pattern) - (?(condition)yes-pattern|no-pattern) - - If the condition is satisfied, the yes-pattern is used; otherwise the - no-pattern (if present) is used. If there are more than two alterna- - tives in the subpattern, a compile-time error occurs. - - There are four kinds of condition: references to subpatterns, refer- - ences to recursion, a pseudo-condition called DEFINE, and assertions. - - Checking for a used subpattern by number - - If the text between the parentheses consists of a sequence of digits, - the condition is true if the capturing subpattern of that number has - previously matched. An alternative notation is to precede the digits - with a plus or minus sign. In this case, the subpattern number is rela- - tive rather than absolute. The most recently opened parentheses can be - referenced by (?(-1), the next most recent by (?(-2), and so on. In - looping constructs it can also make sense to refer to subsequent groups - with constructs such as (?(+2). - - Consider the following pattern, which contains non-significant white - space to make it more readable (assume the PCRE_EXTENDED option) and to - divide it into three parts for ease of discussion: - - ( \( )? [^()]+ (?(1) \) ) - - The first part matches an optional opening parenthesis, and if that - character is present, sets it as the first captured substring. The sec- - ond part matches one or more characters that are not parentheses. The - third part is a conditional subpattern that tests whether the first set - of parentheses matched or not. If they did, that is, if subject started - with an opening parenthesis, the condition is true, and so the yes-pat- - tern is executed and a closing parenthesis is required. Otherwise, - since no-pattern is not present, the subpattern matches nothing. In - other words, this pattern matches a sequence of non-parentheses, - optionally enclosed in parentheses. - - If you were embedding this pattern in a larger one, you could use a - relative reference: - - ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... - - This makes the fragment independent of the parentheses in the larger - pattern. - - Checking for a used subpattern by name - - Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a - used subpattern by name. For compatibility with earlier versions of - PCRE, which had this facility before Perl, the syntax (?(name)...) is - also recognized. However, there is a possible ambiguity with this syn- - tax, because subpattern names may consist entirely of digits. PCRE - looks first for a named subpattern; if it cannot find one and the name - consists entirely of digits, PCRE looks for a subpattern of that num- - ber, which must be greater than zero. Using subpattern names that con- - sist entirely of digits is not recommended. - - Rewriting the above example to use a named subpattern gives this: - - (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) - - - Checking for pattern recursion - - If the condition is the string (R), and there is no subpattern with the - name R, the condition is true if a recursive call to the whole pattern - or any subpattern has been made. If digits or a name preceded by amper- - sand follow the letter R, for example: - - (?(R3)...) or (?(R&name)...) - - the condition is true if the most recent recursion is into the subpat- - tern whose number or name is given. This condition does not check the - entire recursion stack. - - At "top level", all these recursion test conditions are false. Recur- - sive patterns are described below. - - Defining subpatterns for use by reference only - - If the condition is the string (DEFINE), and there is no subpattern - with the name DEFINE, the condition is always false. In this case, - there may be only one alternative in the subpattern. It is always - skipped if control reaches this point in the pattern; the idea of - DEFINE is that it can be used to define "subroutines" that can be ref- - erenced from elsewhere. (The use of "subroutines" is described below.) - For example, a pattern to match an IPv4 address could be written like - this (ignore whitespace and line breaks): - - (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) - \b (?&byte) (\.(?&byte)){3} \b - - The first part of the pattern is a DEFINE group inside which a another - group named "byte" is defined. This matches an individual component of - an IPv4 address (a number less than 256). When matching takes place, - this part of the pattern is skipped because DEFINE acts like a false - condition. - - The rest of the pattern uses references to the named group to match the - four dot-separated components of an IPv4 address, insisting on a word - boundary at each end. - - Assertion conditions - - If the condition is not in any of the above formats, it must be an - assertion. This may be a positive or negative lookahead or lookbehind - assertion. Consider this pattern, again containing non-significant - white space, and with the two alternatives on the second line: - - (?(?=[^a-z]*[a-z]) - \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) - - The condition is a positive lookahead assertion that matches an - optional sequence of non-letters followed by a letter. In other words, - it tests for the presence of at least one letter in the subject. If a - letter is found, the subject is matched against the first alternative; - otherwise it is matched against the second. This pattern matches - strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are - letters and dd are digits. - - -COMMENTS - - The sequence (?# marks the start of a comment that continues up to the - next closing parenthesis. Nested parentheses are not permitted. The - characters that make up a comment play no part in the pattern matching - at all. - - If the PCRE_EXTENDED option is set, an unescaped # character outside a - character class introduces a comment that continues to immediately - after the next newline in the pattern. - - -RECURSIVE PATTERNS - - Consider the problem of matching a string in parentheses, allowing for - unlimited nested parentheses. Without the use of recursion, the best - that can be done is to use a pattern that matches up to some fixed - depth of nesting. It is not possible to handle an arbitrary nesting - depth. - - For some time, Perl has provided a facility that allows regular expres- - sions to recurse (amongst other things). It does this by interpolating - Perl code in the expression at run time, and the code can refer to the - expression itself. A Perl pattern using code interpolation to solve the - parentheses problem can be created like this: - - $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; - - The (?p{...}) item interpolates Perl code at run time, and in this case - refers recursively to the pattern in which it appears. - - Obviously, PCRE cannot support the interpolation of Perl code. Instead, - it supports special syntax for recursion of the entire pattern, and - also for individual subpattern recursion. After its introduction in - PCRE and Python, this kind of recursion was introduced into Perl at - release 5.10. - - A special item that consists of (? followed by a number greater than - zero and a closing parenthesis is a recursive call of the subpattern of - the given number, provided that it occurs inside that subpattern. (If - not, it is a "subroutine" call, which is described in the next sec- - tion.) The special item (?R) or (?0) is a recursive call of the entire - regular expression. - - In PCRE (like Python, but unlike Perl), a recursive subpattern call is - always treated as an atomic group. That is, once it has matched some of - the subject string, it is never re-entered, even if it contains untried - alternatives and there is a subsequent matching failure. - - This PCRE pattern solves the nested parentheses problem (assume the - PCRE_EXTENDED option is set so that white space is ignored): - - \( ( (?>[^()]+) | (?R) )* \) - - First it matches an opening parenthesis. Then it matches any number of - substrings which can either be a sequence of non-parentheses, or a - recursive match of the pattern itself (that is, a correctly parenthe- - sized substring). Finally there is a closing parenthesis. - - If this were part of a larger pattern, you would not want to recurse - the entire pattern, so instead you could use this: - - ( \( ( (?>[^()]+) | (?1) )* \) ) - - We have put the pattern into parentheses, and caused the recursion to - refer to them instead of the whole pattern. - - In a larger pattern, keeping track of parenthesis numbers can be - tricky. This is made easier by the use of relative references. (A Perl - 5.10 feature.) Instead of (?1) in the pattern above you can write - (?-2) to refer to the second most recently opened parentheses preceding - the recursion. In other words, a negative number counts capturing - parentheses leftwards from the point at which it is encountered. - - It is also possible to refer to subsequently opened parentheses, by - writing references such as (?+2). However, these cannot be recursive - because the reference is not inside the parentheses that are refer- - enced. They are always "subroutine" calls, as described in the next - section. - - An alternative approach is to use named parentheses instead. The Perl - syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also - supported. We could rewrite the above example as follows: - - (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) - - If there is more than one subpattern with the same name, the earliest - one is used. - - This particular example pattern that we have been looking at contains - nested unlimited repeats, and so the use of atomic grouping for match- - ing strings of non-parentheses is important when applying the pattern - to strings that do not match. For example, when this pattern is applied - to - - (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() - - it yields "no match" quickly. However, if atomic grouping is not used, - the match runs for a very long time indeed because there are so many - different ways the + and * repeats can carve up the subject, and all - have to be tested before failure can be reported. - - At the end of a match, the values set for any capturing subpatterns are - those from the outermost level of the recursion at which the subpattern - value is set. If you want to obtain intermediate values, a callout - function can be used (see below and the pcrecallout documentation). If - the pattern above is matched against - - (ab(cd)ef) - - the value for the capturing parentheses is "ef", which is the last - value taken on at the top level. If additional parentheses are added, - giving - - \( ( ( (?>[^()]+) | (?R) )* ) \) - ^ ^ - ^ ^ - - the string they capture is "ab(cd)ef", the contents of the top level - parentheses. If there are more than 15 capturing parentheses in a pat- - tern, PCRE has to obtain extra memory to store data during a recursion, - which it does by using pcre_malloc, freeing it via pcre_free after- - wards. If no memory can be obtained, the match fails with the - PCRE_ERROR_NOMEMORY error. - - Do not confuse the (?R) item with the condition (R), which tests for - recursion. Consider this pattern, which matches text in angle brack- - ets, allowing for arbitrary nesting. Only digits are allowed in nested - brackets (that is, when recursing), whereas any characters are permit- - ted at the outer level. - - < (?: (?(R) \d++ | [^<>]*+) | (?R)) * > - - In this pattern, (?(R) is the start of a conditional subpattern, with - two different alternatives for the recursive and non-recursive cases. - The (?R) item is the actual recursive call. - - -SUBPATTERNS AS SUBROUTINES - - If the syntax for a recursive subpattern reference (either by number or - by name) is used outside the parentheses to which it refers, it oper- - ates like a subroutine in a programming language. The "called" subpat- - tern may be defined before or after the reference. A numbered reference - can be absolute or relative, as in these examples: - - (...(absolute)...)...(?2)... - (...(relative)...)...(?-1)... - (...(?+1)...(relative)... - - An earlier example pointed out that the pattern - - (sens|respons)e and \1ibility - - matches "sense and sensibility" and "response and responsibility", but - not "sense and responsibility". If instead the pattern - - (sens|respons)e and (?1)ibility - - is used, it does match "sense and responsibility" as well as the other - two strings. Another example is given in the discussion of DEFINE - above. - - Like recursive subpatterns, a "subroutine" call is always treated as an - atomic group. That is, once it has matched some of the subject string, - it is never re-entered, even if it contains untried alternatives and - there is a subsequent matching failure. - - When a subpattern is used as a subroutine, processing options such as - case-independence are fixed when the subpattern is defined. They cannot - be changed for different calls. For example, consider this pattern: - - (abc)(?i:(?-1)) - - It matches "abcabc". It does not match "abcABC" because the change of - processing option does not affect the called subpattern. - - -CALLOUTS - - Perl has a feature whereby using the sequence (?{...}) causes arbitrary - Perl code to be obeyed in the middle of matching a regular expression. - This makes it possible, amongst other things, to extract different sub- - strings that match the same pair of parentheses when there is a repeti- - tion. - - PCRE provides a similar feature, but of course it cannot obey arbitrary - Perl code. The feature is called "callout". The caller of PCRE provides - an external function by putting its entry point in the global variable - pcre_callout. By default, this variable contains NULL, which disables - all calling out. - - Within a regular expression, (?C) indicates the points at which the - external function is to be called. If you want to identify different - callout points, you can put a number less than 256 after the letter C. - The default value is zero. For example, this pattern has two callout - points: - - (?C1)abc(?C2)def - - If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are - automatically installed before each item in the pattern. They are all - numbered 255. - - During matching, when PCRE reaches a callout point (and pcre_callout is - set), the external function is called. It is provided with the number - of the callout, the position in the pattern, and, optionally, one item - of data originally supplied by the caller of pcre_exec(). The callout - function may cause matching to proceed, to backtrack, or to fail alto- - gether. A complete description of the interface to the callout function - is given in the pcrecallout documentation. - - -SEE ALSO - - pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). - - -AUTHOR - - Philip Hazel - University Computing Service - Cambridge CB2 3QH, England. - - -REVISION - - Last updated: 19 June 2007 - Copyright (c) 1997-2007 University of Cambridge. diff --git a/doc/doc-txt/pcretest.txt b/doc/doc-txt/pcretest.txt deleted file mode 100644 index d93ec26d1..000000000 --- a/doc/doc-txt/pcretest.txt +++ /dev/null @@ -1,630 +0,0 @@ -This file contains the PCRE man page that described the pcretest program. Note -that not all of the features of PCRE are available in the limited version that -is built with Exim. -------------------------------------------------------------------------------- - -PCRETEST(1) PCRETEST(1) - - -NAME - pcretest - a program for testing Perl-compatible regular expressions. - - -SYNOPSIS - - pcretest [options] [source] [destination] - - pcretest was written as a test program for the PCRE regular expression - library itself, but it can also be used for experimenting with regular - expressions. This document describes the features of the test program; - for details of the regular expressions themselves, see the pcrepattern - documentation. For details of the PCRE library function calls and their - options, see the pcreapi documentation. - - -OPTIONS - - -b Behave as if each regex has the /B (show bytecode) modifier; - the internal form is output after compilation. - - -C Output the version number of the PCRE library, and all avail- - able information about the optional features that are - included, and then exit. - - -d Behave as if each regex has the /D (debug) modifier; the - internal form and information about the compiled pattern is - output after compilation; -d is equivalent to -b -i. - - -dfa Behave as if each data line contains the \D escape sequence; - this causes the alternative matching function, - pcre_dfa_exec(), to be used instead of the standard - pcre_exec() function (more detail is given below). - - -help Output a brief summary these options and then exit. - - -i Behave as if each regex has the /I modifier; information - about the compiled pattern is given after compilation. - - -m Output the size of each compiled pattern after it has been - compiled. This is equivalent to adding /M to each regular - expression. For compatibility with earlier versions of - pcretest, -s is a synonym for -m. - - -o osize Set the number of elements in the output vector that is used - when calling pcre_exec() or pcre_dfa_exec() to be osize. The - default value is 45, which is enough for 14 capturing subex- - pressions for pcre_exec() or 22 different matches for - pcre_dfa_exec(). The vector size can be changed for individ- - ual matching calls by including \O in the data line (see - below). - - -p Behave as if each regex has the /P modifier; the POSIX wrap- - per API is used to call PCRE. None of the other options has - any effect when -p is set. - - -q Do not output the version number of pcretest at the start of - execution. - - -S size On Unix-like systems, set the size of the runtime stack to - size megabytes. - - -t Run each compile, study, and match many times with a timer, - and output resulting time per compile or match (in millisec- - onds). Do not set -m with -t, because you will then get the - size output a zillion times, and the timing will be dis- - torted. You can control the number of iterations that are - used for timing by following -t with a number (as a separate - item on the command line). For example, "-t 1000" would iter- - ate 1000 times. The default is to iterate 500000 times. - - -tm This is like -t except that it times only the matching phase, - not the compile or study phases. - - -DESCRIPTION - - If pcretest is given two filename arguments, it reads from the first - and writes to the second. If it is given only one filename argument, it - reads from that file and writes to stdout. Otherwise, it reads from - stdin and writes to stdout, and prompts for each line of input, using - "re>" to prompt for regular expressions, and "data>" to prompt for data - lines. - - The program handles any number of sets of input on a single input file. - Each set starts with a regular expression, and continues with any num- - ber of data lines to be matched against the pattern. - - Each data line is matched separately and independently. If you want to - do multi-line matches, you have to use the \n escape sequence (or \r or - \r\n, etc., depending on the newline setting) in a single line of input - to encode the newline sequences. There is no limit on the length of - data lines; the input buffer is automatically extended if it is too - small. - - An empty line signals the end of the data lines, at which point a new - regular expression is read. The regular expressions are given enclosed - in any non-alphanumeric delimiters other than backslash, for example: - - /(a|bc)x+yz/ - - White space before the initial delimiter is ignored. A regular expres- - sion may be continued over several input lines, in which case the new- - line characters are included within it. It is possible to include the - delimiter within the pattern by escaping it, for example - - /abc\/def/ - - If you do so, the escape and the delimiter form part of the pattern, - but since delimiters are always non-alphanumeric, this does not affect - its interpretation. If the terminating delimiter is immediately fol- - lowed by a backslash, for example, - - /abc/\ - - then a backslash is added to the end of the pattern. This is done to - provide a way of testing the error condition that arises if a pattern - finishes with a backslash, because - - /abc\/ - - is interpreted as the first line of a pattern that starts with "abc/", - causing pcretest to read the next line as a continuation of the regular - expression. - - -PATTERN MODIFIERS - - A pattern may be followed by any number of modifiers, which are mostly - single characters. Following Perl usage, these are referred to below - as, for example, "the /i modifier", even though the delimiter of the - pattern need not always be a slash, and no slash is used when writing - modifiers. Whitespace may appear between the final pattern delimiter - and the first modifier, and between the modifiers themselves. - - The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE, - PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when pcre_com- - pile() is called. These four modifier letters have the same effect as - they do in Perl. For example: - - /caseless/i - - The following table shows additional modifiers for setting PCRE options - that do not correspond to anything in Perl: - - /A PCRE_ANCHORED - /C PCRE_AUTO_CALLOUT - /E PCRE_DOLLAR_ENDONLY - /f PCRE_FIRSTLINE - /J PCRE_DUPNAMES - /N PCRE_NO_AUTO_CAPTURE - /U PCRE_UNGREEDY - /X PCRE_EXTRA - /<cr> PCRE_NEWLINE_CR - /<lf> PCRE_NEWLINE_LF - /<crlf> PCRE_NEWLINE_CRLF - /<anycrlf> PCRE_NEWLINE_ANYCRLF - /<any> PCRE_NEWLINE_ANY - - Those specifying line ending sequencess are literal strings as shown. - This example sets multiline matching with CRLF as the line ending - sequence: - - /^abc/m<crlf> - - Details of the meanings of these PCRE options are given in the pcreapi - documentation. - - Finding all matches in a string - - Searching for all possible matches within each subject string can be - requested by the /g or /G modifier. After finding a match, PCRE is - called again to search the remainder of the subject string. The differ- - ence between /g and /G is that the former uses the startoffset argument - to pcre_exec() to start searching at a new point within the entire - string (which is in effect what Perl does), whereas the latter passes - over a shortened substring. This makes a difference to the matching - process if the pattern begins with a lookbehind assertion (including \b - or \B). - - If any call to pcre_exec() in a /g or /G sequence matches an empty - string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED - flags set in order to search for another, non-empty, match at the same - point. If this second match fails, the start offset is advanced by - one, and the normal match is retried. This imitates the way Perl han- - dles such cases when using the /g modifier or the split() function. - - Other modifiers - - There are yet more modifiers for controlling the way pcretest operates. - - The /+ modifier requests that as well as outputting the substring that - matched the entire pattern, pcretest should in addition output the - remainder of the subject string. This is useful for tests where the - subject contains multiple copies of the same substring. - - The /B modifier is a debugging feature. It requests that pcretest out- - put a representation of the compiled byte code after compilation. Nor- - mally this information contains length and offset values; however, if - /Z is also present, this data is replaced by spaces. This is a special - feature for use in the automatic test scripts; it ensures that the same - output is generated for different internal link sizes. - - The /L modifier must be followed directly by the name of a locale, for - example, - - /pattern/Lfr_FR - - For this reason, it must be the last modifier. The given locale is set, - pcre_maketables() is called to build a set of character tables for the - locale, and this is then passed to pcre_compile() when compiling the - regular expression. Without an /L modifier, NULL is passed as the - tables pointer; that is, /L applies only to the expression on which it - appears. - - The /I modifier requests that pcretest output information about the - compiled pattern (whether it is anchored, has a fixed first character, - and so on). It does this by calling pcre_fullinfo() after compiling a - pattern. If the pattern is studied, the results of that are also out- - put. - - The /D modifier is a PCRE debugging feature, and is equivalent to /BI, - that is, both the /B and the /I modifiers. - - The /F modifier causes pcretest to flip the byte order of the fields in - the compiled pattern that contain 2-byte and 4-byte numbers. This - facility is for testing the feature in PCRE that allows it to execute - patterns that were compiled on a host with a different endianness. This - feature is not available when the POSIX interface to PCRE is being - used, that is, when the /P pattern modifier is specified. See also the - section about saving and reloading compiled patterns below. - - The /S modifier causes pcre_study() to be called after the expression - has been compiled, and the results used when the expression is matched. - - The /M modifier causes the size of memory block used to hold the com- - piled pattern to be output. - - The /P modifier causes pcretest to call PCRE via the POSIX wrapper API - rather than its native API. When this is done, all other modifiers - except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present, - and REG_NEWLINE is set if /m is present. The wrapper functions force - PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set. - - The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option - set. This turns on support for UTF-8 character handling in PCRE, pro- - vided that it was compiled with this support enabled. This modifier - also causes any non-printing characters in output strings to be printed - using the \x{hh...} notation if they are valid UTF-8 sequences. - - If the /? modifier is used with /8, it causes pcretest to call - pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the - checking of the string for UTF-8 validity. - - -DATA LINES - - Before each data line is passed to pcre_exec(), leading and trailing - whitespace is removed, and it is then scanned for \ escapes. Some of - these are pretty esoteric features, intended for checking out some of - the more complicated features of PCRE. If you are just testing "ordi- - nary" regular expressions, you probably don't need any of these. The - following escapes are recognized: - - \a alarm (BEL, \x07) - \b backspace (\x08) - \e escape (\x27) - \f formfeed (\x0c) - \n newline (\x0a) - \qdd set the PCRE_MATCH_LIMIT limit to dd - (any number of digits) - \r carriage return (\x0d) - \t tab (\x09) - \v vertical tab (\x0b) - \nnn octal character (up to 3 octal digits) - \xhh hexadecimal character (up to 2 hex digits) - \x{hh...} hexadecimal character, any number of digits - in UTF-8 mode - \A pass the PCRE_ANCHORED option to pcre_exec() - or pcre_dfa_exec() - \B pass the PCRE_NOTBOL option to pcre_exec() - or pcre_dfa_exec() - \Cdd call pcre_copy_substring() for substring dd - after a successful match (number less than 32) - \Cname call pcre_copy_named_substring() for substring - "name" after a successful match (name termin- - ated by next non alphanumeric character) - \C+ show the current captured substrings at callout - time - \C- do not supply a callout function - \C!n return 1 instead of 0 when callout number n is - reached - \C!n!m return 1 instead of 0 when callout number n is - reached for the nth time - \C*n pass the number n (may be negative) as callout - data; this is used as the callout return value - \D use the pcre_dfa_exec() match function - \F only shortest match for pcre_dfa_exec() - \Gdd call pcre_get_substring() for substring dd - after a successful match (number less than 32) - \Gname call pcre_get_named_substring() for substring - "name" after a successful match (name termin- - ated by next non-alphanumeric character) - \L call pcre_get_substringlist() after a - successful match - \M discover the minimum MATCH_LIMIT and - MATCH_LIMIT_RECURSION settings - \N pass the PCRE_NOTEMPTY option to pcre_exec() - or pcre_dfa_exec() - \Odd set the size of the output vector passed to - pcre_exec() to dd (any number of digits) - \P pass the PCRE_PARTIAL option to pcre_exec() - or pcre_dfa_exec() - \Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd - (any number of digits) - \R pass the PCRE_DFA_RESTART option to pcre_dfa_exec() - \S output details of memory get/free calls during matching - \Z pass the PCRE_NOTEOL option to pcre_exec() - or pcre_dfa_exec() - \? pass the PCRE_NO_UTF8_CHECK option to - pcre_exec() or pcre_dfa_exec() - \>dd start the match at offset dd (any number of digits); - this sets the startoffset argument for pcre_exec() - or pcre_dfa_exec() - \<cr> pass the PCRE_NEWLINE_CR option to pcre_exec() - or pcre_dfa_exec() - \<lf> pass the PCRE_NEWLINE_LF option to pcre_exec() - or pcre_dfa_exec() - \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre_exec() - or pcre_dfa_exec() - \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre_exec() - or pcre_dfa_exec() - \<any> pass the PCRE_NEWLINE_ANY option to pcre_exec() - or pcre_dfa_exec() - - The escapes that specify line ending sequences are literal strings, - exactly as shown. No more than one newline setting should be present in - any data line. - - A backslash followed by anything else just escapes the anything else. - If the very last character is a backslash, it is ignored. This gives a - way of passing an empty line as data, since a real empty line termi- - nates the data input. - - If \M is present, pcretest calls pcre_exec() several times, with dif- - ferent values in the match_limit and match_limit_recursion fields of - the pcre_extra data structure, until it finds the minimum numbers for - each parameter that allow pcre_exec() to complete. The match_limit num- - ber is a measure of the amount of backtracking that takes place, and - checking it out can be instructive. For most simple matches, the number - is quite small, but for patterns with very large numbers of matching - possibilities, it can become large very quickly with increasing length - of subject string. The match_limit_recursion number is a measure of how - much stack (or, if PCRE is compiled with NO_RECURSE, how much heap) - memory is needed to complete the match attempt. - - When \O is used, the value specified may be higher or lower than the - size set by the -O command line option (or defaulted to 45); \O applies - only to the call of pcre_exec() for the line in which it appears. - - If the /P modifier was present on the pattern, causing the POSIX wrap- - per API to be used, the only option-setting sequences that have any - effect are \B and \Z, causing REG_NOTBOL and REG_NOTEOL, respectively, - to be passed to regexec(). - - The use of \x{hh...} to represent UTF-8 characters is not dependent on - the use of the /8 modifier on the pattern. It is recognized always. - There may be any number of hexadecimal digits inside the braces. The - result is from one to six bytes, encoded according to the UTF-8 rules. - - -THE ALTERNATIVE MATCHING FUNCTION - - By default, pcretest uses the standard PCRE matching function, - pcre_exec() to match each data line. From release 6.0, PCRE supports an - alternative matching function, pcre_dfa_test(), which operates in a - different way, and has some restrictions. The differences between the - two functions are described in the pcrematching documentation. - - If a data line contains the \D escape sequence, or if the command line - contains the -dfa option, the alternative matching function is called. - This function finds all possible matches at a given point. If, however, - the \F escape sequence is present in the data line, it stops after the - first match is found. This is always the shortest possible match. - - -DEFAULT OUTPUT FROM PCRETEST - - This section describes the output when the normal matching function, - pcre_exec(), is being used. - - When a match succeeds, pcretest outputs the list of captured substrings - that pcre_exec() returns, starting with number 0 for the string that - matched the whole pattern. Otherwise, it outputs "No match" or "Partial - match" when pcre_exec() returns PCRE_ERROR_NOMATCH or PCRE_ERROR_PAR- - TIAL, respectively, and otherwise the PCRE negative error number. Here - is an example of an interactive pcretest run. - - $ pcretest - PCRE version 7.0 30-Nov-2006 - - re> /^abc(\d+)/ - data> abc123 - 0: abc123 - 1: 123 - data> xyz - No match - - If the strings contain any non-printing characters, they are output as - \0x escapes, or as \x{...} escapes if the /8 modifier was present on - the pattern. See below for the definition of non-printing characters. - If the pattern has the /+ modifier, the output for substring 0 is fol- - lowed by the the rest of the subject string, identified by "0+" like - this: - - re> /cat/+ - data> cataract - 0: cat - 0+ aract - - If the pattern has the /g or /G modifier, the results of successive - matching attempts are output in sequence, like this: - - re> /\Bi(\w\w)/g - data> Mississippi - 0: iss - 1: ss - 0: iss - 1: ss - 0: ipp - 1: pp - - "No match" is output only if the first match attempt fails. - - If any of the sequences \C, \G, or \L are present in a data line that - is successfully matched, the substrings extracted by the convenience - functions are output with C, G, or L after the string number instead of - a colon. This is in addition to the normal full list. The string length - (that is, the return from the extraction function) is given in paren- - theses after each string for \C and \G. - - Note that whereas patterns can be continued over several lines (a plain - ">" prompt is used for continuations), data lines may not. However new- - lines can be included in data by means of the \n escape (or \r, \r\n, - etc., depending on the newline sequence setting). - - -OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION - - When the alternative matching function, pcre_dfa_exec(), is used (by - means of the \D escape sequence or the -dfa command line option), the - output consists of a list of all the matches that start at the first - point in the subject where there is at least one match. For example: - - re> /(tang|tangerine|tan)/ - data> yellow tangerine\D - 0: tangerine - 1: tang - 2: tan - - (Using the normal matching function on this data finds only "tang".) - The longest matching string is always given first (and numbered zero). - - If /g is present on the pattern, the search for further matches resumes - at the end of the longest match. For example: - - re> /(tang|tangerine|tan)/g - data> yellow tangerine and tangy sultana\D - 0: tangerine - 1: tang - 2: tan - 0: tang - 1: tan - 0: tan - - Since the matching function does not support substring capture, the - escape sequences that are concerned with captured substrings are not - relevant. - - -RESTARTING AFTER A PARTIAL MATCH - - When the alternative matching function has given the PCRE_ERROR_PARTIAL - return, indicating that the subject partially matched the pattern, you - can restart the match with additional subject data by means of the \R - escape sequence. For example: - - re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ - data> 23ja\P\D - Partial match: 23ja - data> n05\R\D - 0: n05 - - For further information about partial matching, see the pcrepartial - documentation. - - -CALLOUTS - - If the pattern contains any callout requests, pcretest's callout func- - tion is called during matching. This works with both matching func- - tions. By default, the called function displays the callout number, the - start and current positions in the text at the callout time, and the - next pattern item to be tested. For example, the output - - --->pqrabcdef - 0 ^ ^ \d - - indicates that callout number 0 occurred for a match attempt starting - at the fourth character of the subject string, when the pointer was at - the seventh character of the data, and when the next pattern item was - \d. Just one circumflex is output if the start and current positions - are the same. - - Callouts numbered 255 are assumed to be automatic callouts, inserted as - a result of the /C pattern modifier. In this case, instead of showing - the callout number, the offset in the pattern, preceded by a plus, is - output. For example: - - re> /\d?[A-E]\*/C - data> E* - --->E* - +0 ^ \d? - +3 ^ [A-E] - +8 ^^ \* - +10 ^ ^ - 0: E* - - The callout function in pcretest returns zero (carry on matching) by - default, but you can use a \C item in a data line (as described above) - to change this. - - Inserting callouts can be helpful when using pcretest to check compli- - cated regular expressions. For further information about callouts, see - the pcrecallout documentation. - - -NON-PRINTING CHARACTERS - - When pcretest is outputting text in the compiled version of a pattern, - bytes other than 32-126 are always treated as non-printing characters - are are therefore shown as hex escapes. - - When pcretest is outputting text that is a matched part of a subject - string, it behaves in the same way, unless a different locale has been - set for the pattern (using the /L modifier). In this case, the - isprint() function to distinguish printing and non-printing characters. - - -SAVING AND RELOADING COMPILED PATTERNS - - The facilities described in this section are not available when the - POSIX inteface to PCRE is being used, that is, when the /P pattern mod- - ifier is specified. - - When the POSIX interface is not in use, you can cause pcretest to write - a compiled pattern to a file, by following the modifiers with > and a - file name. For example: - - /pattern/im >/some/file - - See the pcreprecompile documentation for a discussion about saving and - re-using compiled patterns. - - The data that is written is binary. The first eight bytes are the - length of the compiled pattern data followed by the length of the - optional study data, each written as four bytes in big-endian order - (most significant byte first). If there is no study data (either the - pattern was not studied, or studying did not return any data), the sec- - ond length is zero. The lengths are followed by an exact copy of the - compiled pattern. If there is additional study data, this follows imme- - diately after the compiled pattern. After writing the file, pcretest - expects to read a new pattern. - - A saved pattern can be reloaded into pcretest by specifing < and a file - name instead of a pattern. The name of the file must not contain a < - character, as otherwise pcretest will interpret the line as a pattern - delimited by < characters. For example: - - re> </some/file - Compiled regex loaded from /some/file - No study data - - When the pattern has been loaded, pcretest proceeds to read data lines - in the usual way. - - You can copy a file written by pcretest to a different host and reload - it there, even if the new host has opposite endianness to the one on - which the pattern was compiled. For example, you can compile on an i86 - machine and run on a SPARC machine. - - File names for saving and reloading can be absolute or relative, but - note that the shell facility of expanding a file name that starts with - a tilde (~) is not available. - - The ability to save and reload files in pcretest is intended for test- - ing and experimentation. It is not intended for production use because - only a single pattern can be written to a file. Furthermore, there is - no facility for supplying custom character tables for use with a - reloaded pattern. If the original pattern was compiled with custom - tables, an attempt to match a subject string using a reloaded pattern - is likely to cause pcretest to crash. Finally, if you attempt to load - a file that is not in the correct format, the result is undefined. - - -SEE ALSO - - pcre(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcrepartial(d), - pcrepattern(3), pcreprecompile(3). - - -AUTHOR - - Philip Hazel - University Computing Service - Cambridge CB2 3QH, England. - - -REVISION - - Last updated: 24 April 2007 - Copyright (c) 1997-2007 University of Cambridge. |