Installed PCRE 6.0 sources, which involved adding a number of files and

renaming some others.
author: Philip Hazel <ph10@hermes.cam.ac.uk> 2005-06-15 08:57:10 +0000
committer: Philip Hazel <ph10@hermes.cam.ac.uk> 2005-06-15 08:57:10 +0000
commit: 8ac170f35ed82789928f9e94beaa38991761a88c (patch)
tree: 6cb1d0e1bffabcf23a2cb5d8a529b8f2aa5c2e2f /doc/doc-txt
parent: ef213c3b49bcc37a2882f81a76755f114c2a81cb (diff)
3 files changed, 421 insertions, 317 deletions
diff --git a/doc/doc-txt/ChangeLog b/doc/doc-txt/ChangeLog
index bbdcb9843..5ff2e8869 100644
--- a/doc/doc-txt/ChangeLog
+++ b/doc/doc-txt/ChangeLog
@@ -1,4 +1,4 @@
-$Cambridge: exim/doc/doc-txt/ChangeLog,v 1.155 2005/06/14 13:48:40 ph10 Exp $
+$Cambridge: exim/doc/doc-txt/ChangeLog,v 1.156 2005/06/15 08:57:10 ph10 Exp $
 
 Change log file for Exim from version 4.21
 -------------------------------------------
@@ -116,6 +116,10 @@ PH/12 Applied Alex Kiernan's patch for the API change for the error callback
 
 PH/13 Changed auto_thaw such that it does not apply to bounce messages.
 
+PH/14 Imported PCRE 6.0; this was more than just a trivial operation because
+      the sources for PCRE have been re-arranged and more files are now
+      involved.
+
 
 Exim version 4.51
 -----------------
diff --git a/doc/doc-txt/pcrepattern.txt b/doc/doc-txt/pcrepattern.txt
index 1dc800af4..29fbe5652 100644
--- a/doc/doc-txt/pcrepattern.txt
+++ b/doc/doc-txt/pcrepattern.txt
@@ -1,18 +1,17 @@
-This file contains the PCRE man page that describes the regular expressions 
-supported by PCRE version 5.0. Note that not all of the features are relevant 
+This file contains the PCRE man page that describes the regular expressions
+supported by PCRE version 6.0. Note that not all of the features are relevant
 in the context of Exim. In particular, the version of PCRE that is compiled
 with Exim does not include UTF-8 support, there is no mechanism for changing
 the options with which the PCRE functions are called, and features such as
 callout are not accessible.
 -----------------------------------------------------------------------------
 
-PCRE(3)                                                                PCRE(3)
-
 
 
 NAME
        PCRE - Perl-compatible regular expressions
 
+
 PCRE REGULAR EXPRESSION DETAILS
 
        The  syntax  and semantics of the regular expressions supported by PCRE
@@ -30,6 +29,14 @@ PCRE REGULAR EXPRESSION DETAILS
        of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
        page.
 
+       The remainder of this document discusses the  patterns  that  are  sup-
+       ported  by  PCRE when its main matching function, pcre_exec(), is used.
+       From  release  6.0,   PCRE   offers   a   second   matching   function,
+       pcre_dfa_exec(),  which matches using a different algorithm that is not
+       Perl-compatible. The advantages and disadvantages  of  the  alternative
+       function, and how it differs from the normal function, are discussed in
+       the pcrematching page.
+
        A regular expression is a pattern that is  matched  against  a  subject
        string  from  left  to right. Most characters stand for themselves in a
        pattern, and match the corresponding characters in the  subject.  As  a
@@ -37,15 +44,24 @@ PCRE REGULAR EXPRESSION DETAILS
 
          The quick brown fox
 
-       matches  a portion of a subject string that is identical to itself. The
-       power of regular expressions comes from the ability to include alterna-
-       tives  and repetitions in the pattern. These are encoded in the pattern
-       by the use of metacharacters, which do not  stand  for  themselves  but
-       instead are interpreted in some special way.
-
-       There  are  two different sets of metacharacters: those that are recog-
-       nized anywhere in the pattern except within square brackets, and  those
-       that  are  recognized  in square brackets. Outside square brackets, the
+       matches a portion of a subject string that is identical to itself. When
+       caseless matching is specified (the PCRE_CASELESS option), letters  are
+       matched  independently  of case. In UTF-8 mode, PCRE always understands
+       the concept of case for characters whose values are less than  128,  so
+       caseless  matching  is always possible. For characters with higher val-
+       ues, the concept of case is supported if PCRE is compiled with  Unicode
+       property  support,  but  not  otherwise.   If  you want to use caseless
+       matching for characters 128 and above, you must  ensure  that  PCRE  is
+       compiled with Unicode property support as well as with UTF-8 support.
+
+       The  power  of  regular  expressions  comes from the ability to include
+       alternatives and repetitions in the pattern. These are encoded  in  the
+       pattern by the use of metacharacters, which do not stand for themselves
+       but instead are interpreted in some special way.
+
+       There are two different sets of metacharacters: those that  are  recog-
+       nized  anywhere in the pattern except within square brackets, and those
+       that are recognized in square brackets. Outside  square  brackets,  the
        metacharacters are as follows:
 
          \      general escape character with several uses
@@ -64,7 +80,7 @@ PCRE REGULAR EXPRESSION DETAILS
                 also "possessive quantifier"
          {      start min/max quantifier
 
-       Part of a pattern that is in square brackets  is  called  a  "character
+       Part  of  a  pattern  that is in square brackets is called a "character
        class". In a character class the only metacharacters are:
 
          \      general escape character
@@ -74,33 +90,33 @@ PCRE REGULAR EXPRESSION DETAILS
                   syntax)
          ]      terminates the character class
 
-       The  following sections describe the use of each of the metacharacters.
+       The following sections describe the use of each of the  metacharacters.
 
 
 BACKSLASH
 
        The backslash character has several uses. Firstly, if it is followed by
-       a  non-alphanumeric  character,  it takes away any special meaning that
-       character may have. This  use  of  backslash  as  an  escape  character
+       a non-alphanumeric character, it takes away any  special  meaning  that
+       character  may  have.  This  use  of  backslash  as an escape character
        applies both inside and outside character classes.
 
-       For  example,  if  you want to match a * character, you write \* in the
-       pattern.  This escaping action applies whether  or  not  the  following
-       character  would  otherwise be interpreted as a metacharacter, so it is
-       always safe to precede a non-alphanumeric  with  backslash  to  specify
-       that  it stands for itself. In particular, if you want to match a back-
+       For example, if you want to match a * character, you write  \*  in  the
+       pattern.   This  escaping  action  applies whether or not the following
+       character would otherwise be interpreted as a metacharacter, so  it  is
+       always  safe  to  precede  a non-alphanumeric with backslash to specify
+       that it stands for itself. In particular, if you want to match a  back-
        slash, you write \\.
 
-       If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
-       the  pattern (other than in a character class) and characters between a
+       If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
+       the pattern (other than in a character class) and characters between  a
        # outside a character class and the next newline character are ignored.
-       An  escaping backslash can be used to include a whitespace or # charac-
+       An escaping backslash can be used to include a whitespace or #  charac-
        ter as part of the pattern.
 
-       If you want to remove the special meaning from a  sequence  of  charac-
-       ters,  you can do so by putting them between \Q and \E. This is differ-
-       ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
-       sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
+       If  you  want  to remove the special meaning from a sequence of charac-
+       ters, you can do so by putting them between \Q and \E. This is  differ-
+       ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
+       sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
        tion. Note the following examples:
 
          Pattern            PCRE matches   Perl matches
@@ -110,16 +126,16 @@ BACKSLASH
          \Qabc\$xyz\E       abc\$xyz       abc\$xyz
          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
 
-       The \Q...\E sequence is recognized both inside  and  outside  character
+       The  \Q...\E  sequence  is recognized both inside and outside character
        classes.
 
    Non-printing characters
 
        A second use of backslash provides a way of encoding non-printing char-
-       acters in patterns in a visible manner. There is no restriction on  the
-       appearance  of non-printing characters, apart from the binary zero that
-       terminates a pattern, but when a pattern  is  being  prepared  by  text
-       editing,  it  is  usually  easier  to  use  one of the following escape
+       acters  in patterns in a visible manner. There is no restriction on the
+       appearance of non-printing characters, apart from the binary zero  that
+       terminates  a  pattern,  but  when  a pattern is being prepared by text
+       editing, it is usually easier  to  use  one  of  the  following  escape
        sequences than the binary character it represents:
 
          \a        alarm, that is, the BEL character (hex 07)
@@ -133,44 +149,44 @@ BACKSLASH
          \xhh      character with hex code hh
          \x{hhh..} character with hex code hhh... (UTF-8 mode only)
 
-       The precise effect of \cx is as follows: if x is a lower  case  letter,
-       it  is converted to upper case. Then bit 6 of the character (hex 40) is
-       inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
+       The  precise  effect of \cx is as follows: if x is a lower case letter,
+       it is converted to upper case. Then bit 6 of the character (hex 40)  is
+       inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
        becomes hex 7B.
 
-       After  \x, from zero to two hexadecimal digits are read (letters can be
-       in upper or lower case). In UTF-8 mode, any number of hexadecimal  dig-
-       its  may  appear between \x{ and }, but the value of the character code
-       must be less than 2**31 (that is,  the  maximum  hexadecimal  value  is
-       7FFFFFFF).  If  characters other than hexadecimal digits appear between
-       \x{ and }, or if there is no terminating }, this form of escape is  not
-       recognized. Instead, the initial \x will be interpreted as a basic hex-
-       adecimal escape, with no following digits,  giving  a  character  whose
+       After \x, from zero to two hexadecimal digits are read (letters can  be
+       in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
+       its may appear between \x{ and }, but the value of the  character  code
+       must  be  less  than  2**31  (that is, the maximum hexadecimal value is
+       7FFFFFFF). If characters other than hexadecimal digits  appear  between
+       \x{  and }, or if there is no terminating }, this form of escape is not
+       recognized. Instead, the initial \x will  be  interpreted  as  a  basic
+       hexadecimal  escape, with no following digits, giving a character whose
        value is zero.
 
        Characters whose value is less than 256 can be defined by either of the
-       two syntaxes for \x when PCRE is in UTF-8 mode. There is no  difference
-       in  the  way they are handled. For example, \xdc is exactly the same as
+       two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
+       in the way they are handled. For example, \xdc is exactly the  same  as
        \x{dc}.
 
-       After \0 up to two further octal digits are read.  In  both  cases,  if
-       there  are fewer than two digits, just those that are present are used.
-       Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL
-       character  (code  value  7).  Make sure you supply two digits after the
-       initial zero if the pattern character that follows is itself  an  octal
+       After  \0  up  to  two further octal digits are read. In both cases, if
+       there are fewer than two digits, just those that are present are  used.
+       Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
+       character (code value 7). Make sure you supply  two  digits  after  the
+       initial  zero  if the pattern character that follows is itself an octal
        digit.
 
        The handling of a backslash followed by a digit other than 0 is compli-
        cated.  Outside a character class, PCRE reads it and any following dig-
-       its  as  a  decimal  number. If the number is less than 10, or if there
+       its as a decimal number. If the number is less than  10,  or  if  there
        have been at least that many previous capturing left parentheses in the
-       expression,  the  entire  sequence  is  taken  as  a  back reference. A
-       description of how this works is given later, following the  discussion
+       expression, the entire  sequence  is  taken  as  a  back  reference.  A
+       description  of how this works is given later, following the discussion
        of parenthesized subpatterns.
 
-       Inside  a  character  class, or if the decimal number is greater than 9
-       and there have not been that many capturing subpatterns, PCRE  re-reads
-       up  to three octal digits following the backslash, and generates a sin-
+       Inside a character class, or if the decimal number is  greater  than  9
+       and  there have not been that many capturing subpatterns, PCRE re-reads
+       up to three octal digits following the backslash, and generates a  sin-
        gle byte from the least significant 8 bits of the value. Any subsequent
        digits stand for themselves.  For example:
 
@@ -189,19 +205,19 @@ BACKSLASH
          \81    is either a back reference, or a binary zero
                    followed by the two characters "8" and "1"
 
-       Note  that  octal  values of 100 or greater must not be introduced by a
+       Note that octal values of 100 or greater must not be  introduced  by  a
        leading zero, because no more than three octal digits are ever read.
 
-       All the sequences that define a single byte value  or  a  single  UTF-8
+       All  the  sequences  that  define a single byte value or a single UTF-8
        character (in UTF-8 mode) can be used both inside and outside character
-       classes. In addition, inside a character  class,  the  sequence  \b  is
+       classes.  In  addition,  inside  a  character class, the sequence \b is
        interpreted as the backspace character (hex 08), and the sequence \X is
-       interpreted as the character "X".  Outside  a  character  class,  these
+       interpreted  as  the  character  "X".  Outside a character class, these
        sequences have different meanings (see below).
 
    Generic character types
 
-       The  third  use of backslash is for specifying generic character types.
+       The third use of backslash is for specifying generic  character  types.
        The following are always recognized:
 
          \d     any decimal digit
@@ -212,48 +228,48 @@ BACKSLASH
          \W     any "non-word" character
 
        Each pair of escape sequences partitions the complete set of characters
-       into  two disjoint sets. Any given character matches one, and only one,
+       into two disjoint sets. Any given character matches one, and only  one,
        of each pair.
 
        These character type sequences can appear both inside and outside char-
-       acter  classes.  They each match one character of the appropriate type.
-       If the current matching point is at the end of the subject string,  all
+       acter classes. They each match one character of the  appropriate  type.
+       If  the current matching point is at the end of the subject string, all
        of them fail, since there is no character to match.
 
-       For  compatibility  with Perl, \s does not match the VT character (code
-       11).  This makes it different from the the POSIX "space" class. The  \s
+       For compatibility with Perl, \s does not match the VT  character  (code
+       11).   This makes it different from the the POSIX "space" class. The \s
        characters are HT (9), LF (10), FF (12), CR (13), and space (32).
 
        A "word" character is an underscore or any character less than 256 that
-       is a letter or digit. The definition of  letters  and  digits  is  con-
-       trolled  by PCRE's low-valued character tables, and may vary if locale-
-       specific matching is taking place (see "Locale support" in the  pcreapi
-       page).  For  example,  in  the  "fr_FR" (French) locale, some character
-       codes greater than 128 are used for accented  letters,  and  these  are
+       is  a  letter  or  digit.  The definition of letters and digits is con-
+       trolled by PCRE's low-valued character tables, and may vary if  locale-
+       specific  matching is taking place (see "Locale support" in the pcreapi
+       page). For example, in the  "fr_FR"  (French)  locale,  some  character
+       codes  greater  than  128  are used for accented letters, and these are
        matched by \w.
 
-       In  UTF-8 mode, characters with values greater than 128 never match \d,
+       In UTF-8 mode, characters with values greater than 128 never match  \d,
        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
        code character property support is available.
 
    Unicode character properties
 
        When PCRE is built with Unicode character property support, three addi-
-       tional escape sequences to match generic character types are  available
+       tional  escape sequences to match generic character types are available
        when UTF-8 mode is selected. They are:
 
         \p{xx}   a character with the xx property
         \P{xx}   a character without the xx property
         \X       an extended Unicode sequence
 
-       The  property  names represented by xx above are limited to the Unicode
-       general category properties. Each character has exactly one such  prop-
-       erty,  specified  by  a two-letter abbreviation. For compatibility with
-       Perl, negation can be specified by including a circumflex  between  the
-       opening  brace  and the property name. For example, \p{^Lu} is the same
+       The property names represented by xx above are limited to  the  Unicode
+       general  category properties. Each character has exactly one such prop-
+       erty, specified by a two-letter abbreviation.  For  compatibility  with
+       Perl,  negation  can be specified by including a circumflex between the
+       opening brace and the property name. For example, \p{^Lu} is  the  same
        as \P{Lu}.
 
-       If only one letter is specified with \p or  \P,  it  includes  all  the
+       If  only  one  letter  is  specified with \p or \P, it includes all the
        properties that start with that letter. In this case, in the absence of
        negation, the curly brackets in the escape sequence are optional; these
        two examples have the same effect:
@@ -307,33 +323,33 @@ BACKSLASH
          Zp    Paragraph separator
          Zs    Space separator
 
-       Extended  properties such as "Greek" or "InMusicalSymbols" are not sup-
+       Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-
        ported by PCRE.
 
-       Specifying caseless matching does not affect  these  escape  sequences.
+       Specifying  caseless  matching  does not affect these escape sequences.
        For example, \p{Lu} always matches only upper case letters.
 
-       The  \X  escape  matches  any number of Unicode characters that form an
+       The \X escape matches any number of Unicode  characters  that  form  an
        extended Unicode sequence. \X is equivalent to
 
          (?>\PM\pM*)
 
-       That is, it matches a character without the "mark"  property,  followed
-       by  zero  or  more  characters with the "mark" property, and treats the
-       sequence as an atomic group (see below).  Characters  with  the  "mark"
+       That  is,  it matches a character without the "mark" property, followed
+       by zero or more characters with the "mark"  property,  and  treats  the
+       sequence  as  an  atomic group (see below).  Characters with the "mark"
        property are typically accents that affect the preceding character.
 
-       Matching  characters  by Unicode property is not fast, because PCRE has
-       to search a structure that contains  data  for  over  fifteen  thousand
+       Matching characters by Unicode property is not fast, because  PCRE  has
+       to  search  a  structure  that  contains data for over fifteen thousand
        characters. That is why the traditional escape sequences such as \d and
        \w do not use Unicode properties in PCRE.
 
    Simple assertions
 
        The fourth use of backslash is for certain simple assertions. An asser-
-       tion  specifies a condition that has to be met at a particular point in
-       a match, without consuming any characters from the subject string.  The
-       use  of subpatterns for more complicated assertions is described below.
+       tion specifies a condition that has to be met at a particular point  in
+       a  match, without consuming any characters from the subject string. The
+       use of subpatterns for more complicated assertions is described  below.
        The backslashed assertions are:
 
          \b     matches at a word boundary
@@ -343,42 +359,42 @@ BACKSLASH
          \z     matches at end of subject
          \G     matches at first matching position in subject
 
-       These assertions may not appear in character classes (but note that  \b
+       These  assertions may not appear in character classes (but note that \b
        has a different meaning, namely the backspace character, inside a char-
        acter class).
 
-       A word boundary is a position in the subject string where  the  current
-       character  and  the previous character do not both match \w or \W (i.e.
-       one matches \w and the other matches \W), or the start or  end  of  the
+       A  word  boundary is a position in the subject string where the current
+       character and the previous character do not both match \w or  \W  (i.e.
+       one  matches  \w  and the other matches \W), or the start or end of the
        string if the first or last character matches \w, respectively.
 
-       The  \A,  \Z,  and \z assertions differ from the traditional circumflex
+       The \A, \Z, and \z assertions differ from  the  traditional  circumflex
        and dollar (described in the next section) in that they only ever match
-       at  the  very start and end of the subject string, whatever options are
-       set. Thus, they are independent of multiline mode. These  three  asser-
+       at the very start and end of the subject string, whatever  options  are
+       set.  Thus,  they are independent of multiline mode. These three asser-
        tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
-       affect only the behaviour of the circumflex and dollar  metacharacters.
-       However,  if the startoffset argument of pcre_exec() is non-zero, indi-
+       affect  only the behaviour of the circumflex and dollar metacharacters.
+       However, if the startoffset argument of pcre_exec() is non-zero,  indi-
        cating that matching is to start at a point other than the beginning of
-       the  subject,  \A  can never match. The difference between \Z and \z is
-       that \Z matches before a newline that is  the  last  character  of  the
-       string  as well as at the end of the string, whereas \z matches only at
+       the subject, \A can never match. The difference between \Z  and  \z  is
+       that  \Z  matches  before  a  newline that is the last character of the
+       string as well as at the end of the string, whereas \z matches only  at
        the end.
 
-       The \G assertion is true only when the current matching position is  at
-       the  start point of the match, as specified by the startoffset argument
-       of pcre_exec(). It differs from \A when the  value  of  startoffset  is
-       non-zero.  By calling pcre_exec() multiple times with appropriate argu-
+       The  \G assertion is true only when the current matching position is at
+       the start point of the match, as specified by the startoffset  argument
+       of  pcre_exec().  It  differs  from \A when the value of startoffset is
+       non-zero. By calling pcre_exec() multiple times with appropriate  argu-
        ments, you can mimic Perl's /g option, and it is in this kind of imple-
        mentation where \G can be useful.
 
-       Note,  however,  that  PCRE's interpretation of \G, as the start of the
+       Note, however, that PCRE's interpretation of \G, as the  start  of  the
        current match, is subtly different from Perl's, which defines it as the
-       end  of  the  previous  match. In Perl, these can be different when the
-       previously matched string was empty. Because PCRE does just  one  match
+       end of the previous match. In Perl, these can  be  different  when  the
+       previously  matched  string was empty. Because PCRE does just one match
        at a time, it cannot reproduce this behaviour.
 
-       If  all  the alternatives of a pattern begin with \G, the expression is
+       If all the alternatives of a pattern begin with \G, the  expression  is
        anchored to the starting match position, and the "anchored" flag is set
        in the compiled regular expression.
 
@@ -386,73 +402,73 @@ BACKSLASH
 CIRCUMFLEX AND DOLLAR
 
        Outside a character class, in the default matching mode, the circumflex
-       character is an assertion that is true only  if  the  current  matching
-       point  is  at the start of the subject string. If the startoffset argu-
-       ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
-       PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
+       character  is  an  assertion  that is true only if the current matching
+       point is at the start of the subject string. If the  startoffset  argu-
+       ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
+       PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
        has an entirely different meaning (see below).
 
-       Circumflex need not be the first character of the pattern if  a  number
-       of  alternatives are involved, but it should be the first thing in each
-       alternative in which it appears if the pattern is ever  to  match  that
-       branch.  If all possible alternatives start with a circumflex, that is,
-       if the pattern is constrained to match only at the start  of  the  sub-
-       ject,  it  is  said  to be an "anchored" pattern. (There are also other
+       Circumflex  need  not be the first character of the pattern if a number
+       of alternatives are involved, but it should be the first thing in  each
+       alternative  in  which  it appears if the pattern is ever to match that
+       branch. If all possible alternatives start with a circumflex, that  is,
+       if  the  pattern  is constrained to match only at the start of the sub-
+       ject, it is said to be an "anchored" pattern.  (There  are  also  other
        constructs that can cause a pattern to be anchored.)
 
-       A dollar character is an assertion that is true  only  if  the  current
-       matching  point  is  at  the  end of the subject string, or immediately
+       A  dollar  character  is  an assertion that is true only if the current
+       matching point is at the end of  the  subject  string,  or  immediately
        before a newline character that is the last character in the string (by
-       default).  Dollar  need  not  be the last character of the pattern if a
-       number of alternatives are involved, but it should be the last item  in
-       any  branch  in  which  it appears.  Dollar has no special meaning in a
+       default). Dollar need not be the last character of  the  pattern  if  a
+       number  of alternatives are involved, but it should be the last item in
+       any branch in which it appears.  Dollar has no  special  meaning  in  a
        character class.
 
-       The meaning of dollar can be changed so that it  matches  only  at  the
-       very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
+       The  meaning  of  dollar  can be changed so that it matches only at the
+       very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
        compile time. This does not affect the \Z assertion.
 
        The meanings of the circumflex and dollar characters are changed if the
        PCRE_MULTILINE option is set. When this is the case, they match immedi-
-       ately after and  immediately  before  an  internal  newline  character,
-       respectively,  in addition to matching at the start and end of the sub-
-       ject string. For example,  the  pattern  /^abc$/  matches  the  subject
-       string  "def\nabc"  (where \n represents a newline character) in multi-
+       ately  after  and  immediately  before  an  internal newline character,
+       respectively, in addition to matching at the start and end of the  sub-
+       ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
+       string "def\nabc" (where \n represents a newline character)  in  multi-
        line mode, but not otherwise.  Consequently, patterns that are anchored
-       in  single line mode because all branches start with ^ are not anchored
-       in multiline mode, and a match for  circumflex  is  possible  when  the
-       startoffset   argument   of  pcre_exec()  is  non-zero.  The  PCRE_DOL-
+       in single line mode because all branches start with ^ are not  anchored
+       in  multiline  mode,  and  a  match for circumflex is possible when the
+       startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-
        LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
 
-       Note that the sequences \A, \Z, and \z can be used to match  the  start
-       and  end of the subject in both modes, and if all branches of a pattern
-       start with \A it is always anchored, whether PCRE_MULTILINE is  set  or
+       Note  that  the sequences \A, \Z, and \z can be used to match the start
+       and end of the subject in both modes, and if all branches of a  pattern
+       start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
        not.
 
 
 FULL STOP (PERIOD, DOT)
 
        Outside a character class, a dot in the pattern matches any one charac-
-       ter in the subject, including a non-printing  character,  but  not  (by
-       default)  newline.   In  UTF-8 mode, a dot matches any UTF-8 character,
+       ter  in  the  subject,  including a non-printing character, but not (by
+       default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
        which might be more than one byte long, except (by default) newline. If
-       the  PCRE_DOTALL  option  is set, dots match newlines as well. The han-
-       dling of dot is entirely independent of the handling of circumflex  and
-       dollar,  the  only  relationship  being  that they both involve newline
+       the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-
+       dling  of dot is entirely independent of the handling of circumflex and
+       dollar, the only relationship being  that  they  both  involve  newline
        characters. Dot has no special meaning in a character class.
 
 
 MATCHING A SINGLE BYTE
 
        Outside a character class, the escape sequence \C matches any one byte,
-       both  in  and  out of UTF-8 mode. Unlike a dot, it can match a newline.
-       The feature is provided in Perl in order to match individual  bytes  in
-       UTF-8  mode.  Because  it  breaks  up  UTF-8 characters into individual
-       bytes, what remains in the string may be a malformed UTF-8 string.  For
+       both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.
+       The  feature  is provided in Perl in order to match individual bytes in
+       UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual
+       bytes,  what remains in the string may be a malformed UTF-8 string. For
        this reason, the \C escape sequence is best avoided.
 
-       PCRE  does  not  allow \C to appear in lookbehind assertions (described
-       below), because in UTF-8 mode this would make it impossible  to  calcu-
+       PCRE does not allow \C to appear in  lookbehind  assertions  (described
+       below),  because  in UTF-8 mode this would make it impossible to calcu-
        late the length of the lookbehind.
 
 
@@ -461,35 +477,40 @@ SQUARE BRACKETS AND CHARACTER CLASSES
        An opening square bracket introduces a character class, terminated by a
        closing square bracket. A closing square bracket on its own is not spe-
        cial. If a closing square bracket is required as a member of the class,
-       it should be the first data character in the class  (after  an  initial
+       it  should  be  the first data character in the class (after an initial
        circumflex, if present) or escaped with a backslash.
 
-       A  character  class matches a single character in the subject. In UTF-8
-       mode, the character may occupy more than one byte. A matched  character
+       A character class matches a single character in the subject.  In  UTF-8
+       mode,  the character may occupy more than one byte. A matched character
        must be in the set of characters defined by the class, unless the first
-       character in the class definition is a circumflex, in  which  case  the
-       subject  character  must  not  be in the set defined by the class. If a
-       circumflex is actually required as a member of the class, ensure it  is
+       character  in  the  class definition is a circumflex, in which case the
+       subject character must not be in the set defined by  the  class.  If  a
+       circumflex  is actually required as a member of the class, ensure it is
        not the first character, or escape it with a backslash.
 
-       For  example, the character class [aeiou] matches any lower case vowel,
-       while [^aeiou] matches any character that is not a  lower  case  vowel.
+       For example, the character class [aeiou] matches any lower case  vowel,
+       while  [^aeiou]  matches  any character that is not a lower case vowel.
        Note that a circumflex is just a convenient notation for specifying the
-       characters that are in the class by enumerating those that are  not.  A
-       class  that starts with a circumflex is not an assertion: it still con-
-       sumes a character from the subject string, and therefore  it  fails  if
+       characters  that  are in the class by enumerating those that are not. A
+       class that starts with a circumflex is not an assertion: it still  con-
+       sumes  a  character  from the subject string, and therefore it fails if
        the current pointer is at the end of the string.
 
-       In  UTF-8 mode, characters with values greater than 255 can be included
-       in a class as a literal string of bytes, or by using the  \x{  escaping
+       In UTF-8 mode, characters with values greater than 255 can be  included
+       in  a  class as a literal string of bytes, or by using the \x{ escaping
        mechanism.
 
-       When  caseless  matching  is set, any letters in a class represent both
-       their upper case and lower case versions, so for  example,  a  caseless
-       [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
-       match "A", whereas a caseful version would. When running in UTF-8 mode,
-       PCRE  supports  the  concept of case for characters with values greater
-       than 128 only when it is compiled with Unicode property support.
+       When caseless matching is set, any letters in a  class  represent  both
+       their  upper  case  and lower case versions, so for example, a caseless
+       [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not
+       match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always
+       understands the concept of case for characters whose  values  are  less
+       than  128, so caseless matching is always possible. For characters with
+       higher values, the concept of case is supported  if  PCRE  is  compiled
+       with  Unicode  property support, but not otherwise.  If you want to use
+       caseless matching for characters 128 and above, you  must  ensure  that
+       PCRE  is  compiled  with Unicode property support as well as with UTF-8
+       support.
 
        The newline character is never treated in any special way in  character
        classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE
@@ -1409,5 +1430,5 @@ CALLOUTS
        gether. A complete description of the interface to the callout function
        is given in the pcrecallout documentation.
 
-Last updated: 09 September 2004
-Copyright (c) 1997-2004 University of Cambridge.
+Last updated: 28 February 2005
+Copyright (c) 1997-2005 University of Cambridge.
diff --git a/doc/doc-txt/pcretest.txt b/doc/doc-txt/pcretest.txt
index 9e9b70ef4..384a6c38f 100644
--- a/doc/doc-txt/pcretest.txt
+++ b/doc/doc-txt/pcretest.txt
@@ -1,5 +1,5 @@
-This file contains the PCRE man page that described the pcretest program. Note 
-that not all of the features of PCRE are available in the limited version that 
+This file contains the PCRE man page that described the pcretest program. Note
+that not all of the features of PCRE are available in the limited version that
 is built with Exim.
 -------------------------------------------------------------------------------
 
@@ -12,7 +12,7 @@ NAME
 
 SYNOPSIS
 
-       pcretest [-C] [-d] [-i] [-m] [-o osize] [-p] [-t] [source]
+       pcretest [-C] [-d] [-dfa] [-i] [-m] [-o osize] [-p] [-t] [source]
             [destination]
 
        pcretest  was written as a test program for the PCRE regular expression
@@ -29,95 +29,100 @@ OPTIONS
                  able   information  about  the  optional  features  that  are
                  included, and then exit.
 
-       -d        Behave as if each regex had  the  /D  (debug)  modifier;  the
+       -d        Behave as if each regex has  the  /D  (debug)  modifier;  the
                  internal form is output after compilation.
 
-       -i        Behave  as  if  each  regex  had the /I modifier; information
+       -dfa      Behave  as if each data line contains the \D escape sequence;
+                 this    causes    the    alternative    matching    function,
+                 pcre_dfa_exec(),   to   be   used  instead  of  the  standard
+                 pcre_exec() function (more detail is given below).
+
+       -i        Behave as if each regex  has  the  /I  modifier;  information
                  about the compiled pattern is given after compilation.
 
-       -m        Output the size of each compiled pattern after  it  has  been
-                 compiled.  This  is  equivalent  to adding /M to each regular
-                 expression.  For  compatibility  with  earlier  versions   of
+       -m        Output  the  size  of each compiled pattern after it has been
+                 compiled. This is equivalent to adding  /M  to  each  regular
+                 expression.   For  compatibility  with  earlier  versions  of
                  pcretest, -s is a synonym for -m.
 
-       -o osize  Set  the number of elements in the output vector that is used
-                 when calling pcre_exec() to be osize. The  default  value  is
+       -o osize  Set the number of elements in the output vector that is  used
+                 when  calling  pcre_exec()  to be osize. The default value is
                  45, which is enough for 14 capturing subexpressions. The vec-
-                 tor size can be changed  for  individual  matching  calls  by
+                 tor  size  can  be  changed  for individual matching calls by
                  including \O in the data line (see below).
 
-       -p        Behave  as  if  each regex has /P modifier; the POSIX wrapper
-                 API is used to call PCRE. None of the other options  has  any
-                 effect when -p is set.
+       -p        Behave as if each regex has the /P modifier; the POSIX  wrap-
+                 per  API  is used to call PCRE. None of the other options has
+                 any effect when -p is set.
 
-       -t        Run  each  compile, study, and match many times with a timer,
-                 and output resulting time per compile or match (in  millisec-
-                 onds).  Do  not set -m with -t, because you will then get the
-                 size output a zillion times, and  the  timing  will  be  dis-
+       -t        Run each compile, study, and match many times with  a  timer,
+                 and  output resulting time per compile or match (in millisec-
+                 onds). Do not set -m with -t, because you will then  get  the
+                 size  output  a  zillion  times,  and the timing will be dis-
                  torted.
 
 
 DESCRIPTION
 
-       If  pcretest  is  given two filename arguments, it reads from the first
+       If pcretest is given two filename arguments, it reads  from  the  first
        and writes to the second. If it is given only one filename argument, it
-       reads  from  that  file  and writes to stdout. Otherwise, it reads from
-       stdin and writes to stdout, and prompts for each line of  input,  using
+       reads from that file and writes to stdout.  Otherwise,  it  reads  from
+       stdin  and  writes to stdout, and prompts for each line of input, using
        "re>" to prompt for regular expressions, and "data>" to prompt for data
        lines.
 
        The program handles any number of sets of input on a single input file.
-       Each  set starts with a regular expression, and continues with any num-
+       Each set starts with a regular expression, and continues with any  num-
        ber of data lines to be matched against the pattern.
 
-       Each data line is matched separately and independently. If you want  to
-       do  multiple-line  matches, you have to use the \n escape sequence in a
-       single line of input to encode  the  newline  characters.  The  maximum
+       Each  data line is matched separately and independently. If you want to
+       do multiple-line matches, you have to use the \n escape sequence  in  a
+       single  line  of  input  to  encode the newline characters. The maximum
        length of data line is 30,000 characters.
 
-       An  empty  line signals the end of the data lines, at which point a new
-       regular expression is read. The regular expressions are given  enclosed
+       An empty line signals the end of the data lines, at which point  a  new
+       regular  expression is read. The regular expressions are given enclosed
        in any non-alphanumeric delimiters other than backslash, for example
 
          /(a|bc)x+yz/
 
-       White  space before the initial delimiter is ignored. A regular expres-
-       sion may be continued over several input lines, in which case the  new-
-       line  characters  are included within it. It is possible to include the
+       White space before the initial delimiter is ignored. A regular  expres-
+       sion  may be continued over several input lines, in which case the new-
+       line characters are included within it. It is possible to  include  the
        delimiter within the pattern by escaping it, for example
 
          /abc\/def/
 
-       If you do so, the escape and the delimiter form part  of  the  pattern,
-       but  since delimiters are always non-alphanumeric, this does not affect
-       its interpretation.  If the terminating delimiter is  immediately  fol-
+       If  you  do  so, the escape and the delimiter form part of the pattern,
+       but since delimiters are always non-alphanumeric, this does not  affect
+       its  interpretation.   If the terminating delimiter is immediately fol-
        lowed by a backslash, for example,
 
          /abc/\
 
-       then  a  backslash  is added to the end of the pattern. This is done to
-       provide a way of testing the error condition that arises if  a  pattern
+       then a backslash is added to the end of the pattern. This  is  done  to
+       provide  a  way of testing the error condition that arises if a pattern
        finishes with a backslash, because
 
          /abc\/
 
-       is  interpreted as the first line of a pattern that starts with "abc/",
+       is interpreted as the first line of a pattern that starts with  "abc/",
        causing pcretest to read the next line as a continuation of the regular
        expression.
 
 
 PATTERN MODIFIERS
 
-       A  pattern may be followed by any number of modifiers, which are mostly
-       single characters. Following Perl usage, these are  referred  to  below
-       as,  for  example,  "the /i modifier", even though the delimiter of the
-       pattern need not always be a slash, and no slash is used  when  writing
-       modifiers.  Whitespace  may  appear between the final pattern delimiter
+       A pattern may be followed by any number of modifiers, which are  mostly
+       single  characters.  Following  Perl usage, these are referred to below
+       as, for example, "the /i modifier", even though the  delimiter  of  the
+       pattern  need  not always be a slash, and no slash is used when writing
+       modifiers. Whitespace may appear between the  final  pattern  delimiter
        and the first modifier, and between the modifiers themselves.
 
        The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE,
-       PCRE_DOTALL,  or  PCRE_EXTENDED  options,  respectively, when pcre_com-
-       pile() is called. These four modifier letters have the same  effect  as
+       PCRE_DOTALL, or PCRE_EXTENDED  options,  respectively,  when  pcre_com-
+       pile()  is  called. These four modifier letters have the same effect as
        they do in Perl. For example:
 
          /caseless/i
@@ -128,95 +133,96 @@ PATTERN MODIFIERS
          /A    PCRE_ANCHORED
          /C    PCRE_AUTO_CALLOUT
          /E    PCRE_DOLLAR_ENDONLY
+         /f    PCRE_FIRSTLINE
          /N    PCRE_NO_AUTO_CAPTURE
          /U    PCRE_UNGREEDY
          /X    PCRE_EXTRA
 
-       Searching for all possible matches within each subject  string  can  be
-       requested  by  the  /g  or  /G modifier. After finding a match, PCRE is
+       Searching  for  all  possible matches within each subject string can be
+       requested by the /g or /G modifier. After  finding  a  match,  PCRE  is
        called again to search the remainder of the subject string. The differ-
        ence between /g and /G is that the former uses the startoffset argument
-       to pcre_exec() to start searching at a  new  point  within  the  entire
-       string  (which  is in effect what Perl does), whereas the latter passes
-       over a shortened substring. This makes a  difference  to  the  matching
+       to  pcre_exec()  to  start  searching  at a new point within the entire
+       string (which is in effect what Perl does), whereas the  latter  passes
+       over  a  shortened  substring.  This makes a difference to the matching
        process if the pattern begins with a lookbehind assertion (including \b
        or \B).
 
-       If any call to pcre_exec() in a /g or  /G  sequence  matches  an  empty
-       string,  the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
-       flags set in order to search for another, non-empty, match at the  same
-       point.   If  this  second  match fails, the start offset is advanced by
-       one, and the normal match is retried. This imitates the way  Perl  han-
+       If  any  call  to  pcre_exec()  in a /g or /G sequence matches an empty
+       string, the next call is done with the PCRE_NOTEMPTY and  PCRE_ANCHORED
+       flags  set in order to search for another, non-empty, match at the same
+       point.  If this second match fails, the start  offset  is  advanced  by
+       one,  and  the normal match is retried. This imitates the way Perl han-
        dles such cases when using the /g modifier or the split() function.
 
        There are yet more modifiers for controlling the way pcretest operates.
 
-       The /+ modifier requests that as well as outputting the substring  that
-       matched  the  entire  pattern,  pcretest  should in addition output the
-       remainder of the subject string. This is useful  for  tests  where  the
+       The  /+ modifier requests that as well as outputting the substring that
+       matched the entire pattern, pcretest  should  in  addition  output  the
+       remainder  of  the  subject  string. This is useful for tests where the
        subject contains multiple copies of the same substring.
 
-       The  /L modifier must be followed directly by the name of a locale, for
+       The /L modifier must be followed directly by the name of a locale,  for
        example,
 
          /pattern/Lfr_FR
 
        For this reason, it must be the last modifier. The given locale is set,
-       pcre_maketables()  is called to build a set of character tables for the
-       locale, and this is then passed to pcre_compile()  when  compiling  the
-       regular  expression.  Without  an  /L  modifier,  NULL is passed as the
-       tables pointer; that is, /L applies only to the expression on which  it
+       pcre_maketables() is called to build a set of character tables for  the
+       locale,  and  this  is then passed to pcre_compile() when compiling the
+       regular expression. Without an /L  modifier,  NULL  is  passed  as  the
+       tables  pointer; that is, /L applies only to the expression on which it
        appears.
 
-       The  /I  modifier  requests  that pcretest output information about the
-       compiled pattern (whether it is anchored, has a fixed first  character,
-       and  so  on). It does this by calling pcre_fullinfo() after compiling a
-       pattern. If the pattern is studied, the results of that are  also  out-
+       The /I modifier requests that pcretest  output  information  about  the
+       compiled  pattern (whether it is anchored, has a fixed first character,
+       and so on). It does this by calling pcre_fullinfo() after  compiling  a
+       pattern.  If  the pattern is studied, the results of that are also out-
        put.
 
        The /D modifier is a PCRE debugging feature, which also assumes /I.  It
-       causes the internal form of compiled regular expressions to  be  output
+       causes  the  internal form of compiled regular expressions to be output
        after compilation. If the pattern was studied, the information returned
        is also output.
 
        The /F modifier causes pcretest to flip the byte order of the fields in
-       the  compiled  pattern  that  contain  2-byte  and 4-byte numbers. This
-       facility is for testing the feature in PCRE that allows it  to  execute
+       the compiled pattern that  contain  2-byte  and  4-byte  numbers.  This
+       facility  is  for testing the feature in PCRE that allows it to execute
        patterns that were compiled on a host with a different endianness. This
-       feature is not available when the POSIX  interface  to  PCRE  is  being
-       used,  that is, when the /P pattern modifier is specified. See also the
+       feature  is  not  available  when  the POSIX interface to PCRE is being
+       used, that is, when the /P pattern modifier is specified. See also  the
        section about saving and reloading compiled patterns below.
 
-       The /S modifier causes pcre_study() to be called after  the  expression
+       The  /S  modifier causes pcre_study() to be called after the expression
        has been compiled, and the results used when the expression is matched.
 
-       The /M modifier causes the size of memory block used to hold  the  com-
+       The  /M  modifier causes the size of memory block used to hold the com-
        piled pattern to be output.
 
-       The  /P modifier causes pcretest to call PCRE via the POSIX wrapper API
-       rather than its native API. When this  is  done,  all  other  modifiers
-       except  /i,  /m, and /+ are ignored. REG_ICASE is set if /i is present,
-       and REG_NEWLINE is set if /m is present. The  wrapper  functions  force
-       PCRE_DOLLAR_ENDONLY  always, and PCRE_DOTALL unless REG_NEWLINE is set.
+       The /P modifier causes pcretest to call PCRE via the POSIX wrapper  API
+       rather  than  its  native  API.  When this is done, all other modifiers
+       except /i, /m, and /+ are ignored. REG_ICASE is set if /i  is  present,
+       and  REG_NEWLINE  is  set if /m is present. The wrapper functions force
+       PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is  set.
 
-       The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8  option
-       set.  This  turns on support for UTF-8 character handling in PCRE, pro-
-       vided that it was compiled with this  support  enabled.  This  modifier
+       The  /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
+       set. This turns on support for UTF-8 character handling in  PCRE,  pro-
+       vided  that  it  was  compiled with this support enabled. This modifier
        also causes any non-printing characters in output strings to be printed
        using the \x{hh...} notation if they are valid UTF-8 sequences.
 
-       If the /? modifier  is  used  with  /8,  it  causes  pcretest  to  call
-       pcre_compile()  with  the  PCRE_NO_UTF8_CHECK  option,  to suppress the
+       If  the  /?  modifier  is  used  with  /8,  it  causes pcretest to call
+       pcre_compile() with the  PCRE_NO_UTF8_CHECK  option,  to  suppress  the
        checking of the string for UTF-8 validity.
 
 
 DATA LINES
 
-       Before each data line is passed to pcre_exec(),  leading  and  trailing
-       whitespace  is  removed,  and it is then scanned for \ escapes. Some of
-       these are pretty esoteric features, intended for checking out  some  of
-       the  more  complicated features of PCRE. If you are just testing "ordi-
-       nary" regular expressions, you probably don't need any  of  these.  The
+       Before  each  data  line is passed to pcre_exec(), leading and trailing
+       whitespace is removed, and it is then scanned for \  escapes.  Some  of
+       these  are  pretty esoteric features, intended for checking out some of
+       the more complicated features of PCRE. If you are just  testing  "ordi-
+       nary"  regular  expressions,  you probably don't need any of these. The
        following escapes are recognized:
 
          \a         alarm (= BEL)
@@ -247,6 +253,8 @@ DATA LINES
                       reached for the nth time
          \C*n       pass the number n (may be negative) as callout
                       data; this is used as the callout return value
+         \D         use the pcre_dfa_exec() match function
+         \F         only shortest match for pcre_dfa_exec()
          \Gdd       call pcre_get_substring() for substring dd
                       after a successful match (number less than 32)
          \Gname     call pcre_get_named_substring() for substring
@@ -259,6 +267,8 @@ DATA LINES
          \Odd       set the size of the output vector passed to
                       pcre_exec() to dd (any number of digits)
          \P         pass the PCRE_PARTIAL option to pcre_exec()
+                      or pcre_dfa_exec()
+         \R         pass the PCRE_DFA_RESTART option to pcre_dfa_exec()
          \S         output details of memory get/free calls during matching
          \Z         pass the PCRE_NOTEOL option to pcre_exec()
          \?         pass the PCRE_NO_UTF8_CHECK option to
@@ -266,35 +276,53 @@ DATA LINES
          \>dd       start the match at offset dd (any number of digits);
                       this sets the startoffset argument for pcre_exec()
 
-       A  backslash  followed by anything else just escapes the anything else.
-       If the very last character is a backslash, it is ignored. This gives  a
-       way  of  passing  an empty line as data, since a real empty line termi-
+       A backslash followed by anything else just escapes the  anything  else.
+       If  the very last character is a backslash, it is ignored. This gives a
+       way of passing an empty line as data, since a real  empty  line  termi-
        nates the data input.
 
-       If \M is present, pcretest calls pcre_exec() several times,  with  dif-
-       ferent  values  in  the match_limit field of the pcre_extra data struc-
-       ture, until it finds the minimum number that is needed for  pcre_exec()
-       to  complete.  This  number is a measure of the amount of recursion and
-       backtracking that takes place, and checking it out can be  instructive.
-       For  most  simple  matches, the number is quite small, but for patterns
-       with very large numbers of matching possibilities, it can become  large
+       If  \M  is present, pcretest calls pcre_exec() several times, with dif-
+       ferent values in the match_limit field of the  pcre_extra  data  struc-
+       ture,  until it finds the minimum number that is needed for pcre_exec()
+       to complete. This number is a measure of the amount  of  recursion  and
+       backtracking  that takes place, and checking it out can be instructive.
+       For most simple matches, the number is quite small,  but  for  patterns
+       with  very large numbers of matching possibilities, it can become large
        very quickly with increasing length of subject string.
 
-       When  \O  is  used, the value specified may be higher or lower than the
+       When \O is used, the value specified may be higher or  lower  than  the
        size set by the -O command line option (or defaulted to 45); \O applies
        only to the call of pcre_exec() for the line in which it appears.
 
-       If  the /P modifier was present on the pattern, causing the POSIX wrap-
-       per API to be used, only \B and \Z have any effect, causing  REG_NOTBOL
+       If the /P modifier was present on the pattern, causing the POSIX  wrap-
+       per  API to be used, only \B and \Z have any effect, causing REG_NOTBOL
        and REG_NOTEOL to be passed to regexec() respectively.
 
-       The  use of \x{hh...} to represent UTF-8 characters is not dependent on
-       the use of the /8 modifier on the pattern.  It  is  recognized  always.
-       There  may  be  any number of hexadecimal digits inside the braces. The
-       result is from one to six bytes, encoded according to the UTF-8  rules.
+       The use of \x{hh...} to represent UTF-8 characters is not dependent  on
+       the  use  of  the  /8 modifier on the pattern. It is recognized always.
+       There may be any number of hexadecimal digits inside  the  braces.  The
+       result  is from one to six bytes, encoded according to the UTF-8 rules.
+
+
+THE ALTERNATIVE MATCHING FUNCTION
+
+       By  default,  pcretest  uses  the  standard  PCRE  matching   function,
+       pcre_exec() to match each data line. From release 6.0, PCRE supports an
+       alternative matching function, pcre_dfa_test(),  which  operates  in  a
+       different  way,  and has some restrictions. The differences between the
+       two functions are described in the pcrematching documentation.
+
+       If a data line contains the \D escape sequence, or if the command  line
+       contains  the -dfa option, the alternative matching function is called.
+       This function finds all possible matches at a given point. If, however,
+       the  \F escape sequence is present in the data line, it stops after the
+       first match is found. This is always the shortest possible match.
 
 
-OUTPUT FROM PCRETEST
+DEFAULT OUTPUT FROM PCRETEST
+
+       This section describes the output when the  normal  matching  function,
+       pcre_exec(), is being used.
 
        When a match succeeds, pcretest outputs the list of captured substrings
        that pcre_exec() returns, starting with number 0 for  the  string  that
@@ -350,25 +378,76 @@ OUTPUT FROM PCRETEST
        lines can be included in data by means of the \n escape.
 
 
+OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
+
+       When the alternative matching function, pcre_dfa_exec(),  is  used  (by
+       means  of  the \D escape sequence or the -dfa command line option), the
+       output consists of a list of all the matches that start  at  the  first
+       point in the subject where there is at least one match. For example:
+
+           re> /(tang|tangerine|tan)/
+         data> yellow tangerine\D
+          0: tangerine
+          1: tang
+          2: tan
+
+       (Using  the  normal  matching function on this data finds only "tang".)
+       The longest matching string is always given first (and numbered  zero).
+
+       If  /gP  is  present  on  the  pattern,  the search for further matches
+       resumes at the end of the longest match. For example:
+
+           re> /(tang|tangerine|tan)/g
+         data> yellow tangerine and tangy sultana\D
+          0: tangerine
+          1: tang
+          2: tan
+          0: tang
+          1: tan
+          0: tan
+
+       Since the matching function does not  support  substring  capture,  the
+       escape  sequences  that  are concerned with captured substrings are not
+       relevant.
+
+
+RESTARTING AFTER A PARTIAL MATCH
+
+       When the alternative matching function has given the PCRE_ERROR_PARTIAL
+       return,  indicating that the subject partially matched the pattern, you
+       can restart the match with additional subject data by means of  the  \R
+       escape sequence. For example:
+
+           re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
+         data> 23ja\P\D
+         Partial match: 23ja
+         data> n05\R\D
+          0: n05
+
+       For  further  information  about  partial matching, see the pcrepartial
+       documentation.
+
+
 CALLOUTS
 
        If the pattern contains any callout requests, pcretest's callout  func-
-       tion  is  called  during  matching. By default, it displays the callout
-       number, the start and current positions in  the  text  at  the  callout
-       time, and the next pattern item to be tested. For example, the output
+       tion  is  called  during  matching. This works with both matching func-
+       tions. By default, the called function displays the callout number, the
+       start  and  current  positions in the text at the callout time, and the
+       next pattern item to be tested. For example, the output
 
          --->pqrabcdef
            0    ^  ^     \d
 
-       indicates  that  callout number 0 occurred for a match attempt starting
-       at the fourth character of the subject string, when the pointer was  at
-       the  seventh  character of the data, and when the next pattern item was
-       \d. Just one circumflex is output if the start  and  current  positions
+       indicates that callout number 0 occurred for a match  attempt  starting
+       at  the fourth character of the subject string, when the pointer was at
+       the seventh character of the data, and when the next pattern  item  was
+       \d.  Just  one  circumflex is output if the start and current positions
        are the same.
 
        Callouts numbered 255 are assumed to be automatic callouts, inserted as
-       a result of the /C pattern modifier. In this case, instead  of  showing
-       the  callout  number, the offset in the pattern, preceded by a plus, is
+       a  result  of the /C pattern modifier. In this case, instead of showing
+       the callout number, the offset in the pattern, preceded by a  plus,  is
        output. For example:
 
            re> /\d?[A-E]\*/C
@@ -380,76 +459,76 @@ CALLOUTS
          +10 ^ ^
           0: E*
 
-       The callout function in pcretest returns zero (carry  on  matching)  by
-       default, but you can use an \C item in a data line (as described above)
+       The  callout  function  in pcretest returns zero (carry on matching) by
+       default, but you can use a \C item in a data line (as described  above)
        to change this.
 
-       Inserting callouts can be helpful when using pcretest to check  compli-
-       cated  regular expressions. For further information about callouts, see
+       Inserting  callouts can be helpful when using pcretest to check compli-
+       cated regular expressions. For further information about callouts,  see
        the pcrecallout documentation.
 
 
 SAVING AND RELOADING COMPILED PATTERNS
 
-       The facilities described in this section are  not  available  when  the
+       The  facilities  described  in  this section are not available when the
        POSIX inteface to PCRE is being used, that is, when the /P pattern mod-
        ifier is specified.
 
        When the POSIX interface is not in use, you can cause pcretest to write
-       a  compiled  pattern to a file, by following the modifiers with > and a
+       a compiled pattern to a file, by following the modifiers with >  and  a
        file name.  For example:
 
          /pattern/im >/some/file
 
-       See the pcreprecompile documentation for a discussion about saving  and
+       See  the pcreprecompile documentation for a discussion about saving and
        re-using compiled patterns.
 
-       The  data  that  is  written  is  binary. The first eight bytes are the
-       length of the compiled pattern data  followed  by  the  length  of  the
-       optional  study  data,  each  written as four bytes in big-endian order
-       (most significant byte first). If there is no study  data  (either  the
+       The data that is written is binary.  The  first  eight  bytes  are  the
+       length  of  the  compiled  pattern  data  followed by the length of the
+       optional study data, each written as four  bytes  in  big-endian  order
+       (most  significant  byte  first). If there is no study data (either the
        pattern was not studied, or studying did not return any data), the sec-
-       ond length is zero. The lengths are followed by an exact  copy  of  the
+       ond  length  is  zero. The lengths are followed by an exact copy of the
        compiled pattern. If there is additional study data, this follows imme-
-       diately after the compiled pattern. After writing  the  file,  pcretest
+       diately  after  the  compiled pattern. After writing the file, pcretest
        expects to read a new pattern.
 
        A saved pattern can be reloaded into pcretest by specifing < and a file
-       name instead of a pattern. The name of the file must not  contain  a  <
-       character,  as  otherwise pcretest will interpret the line as a pattern
+       name  instead  of  a pattern. The name of the file must not contain a <
+       character, as otherwise pcretest will interpret the line as  a  pattern
        delimited by < characters.  For example:
 
           re> </some/file
          Compiled regex loaded from /some/file
          No study data
 
-       When the pattern has been loaded, pcretest proceeds to read data  lines
+       When  the pattern has been loaded, pcretest proceeds to read data lines
        in the usual way.
 
-       You  can copy a file written by pcretest to a different host and reload
-       it there, even if the new host has opposite endianness to  the  one  on
-       which  the pattern was compiled. For example, you can compile on an i86
+       You can copy a file written by pcretest to a different host and  reload
+       it  there,  even  if the new host has opposite endianness to the one on
+       which the pattern was compiled. For example, you can compile on an  i86
        machine and run on a SPARC machine.
 
-       File names for saving and reloading can be absolute  or  relative,  but
-       note  that the shell facility of expanding a file name that starts with
+       File  names  for  saving and reloading can be absolute or relative, but
+       note that the shell facility of expanding a file name that starts  with
        a tilde (~) is not available.
 
-       The ability to save and reload files in pcretest is intended for  test-
-       ing  and experimentation. It is not intended for production use because
-       only a single pattern can be written to a file. Furthermore,  there  is
-       no  facility  for  supplying  custom  character  tables  for use with a
-       reloaded pattern. If the original  pattern  was  compiled  with  custom
-       tables,  an  attempt to match a subject string using a reloaded pattern
-       is likely to cause pcretest to crash.  Finally, if you attempt to  load
+       The  ability to save and reload files in pcretest is intended for test-
+       ing and experimentation. It is not intended for production use  because
+       only  a  single pattern can be written to a file. Furthermore, there is
+       no facility for supplying  custom  character  tables  for  use  with  a
+       reloaded  pattern.  If  the  original  pattern was compiled with custom
+       tables, an attempt to match a subject string using a  reloaded  pattern
+       is  likely to cause pcretest to crash.  Finally, if you attempt to load
        a file that is not in the correct format, the result is undefined.
 
 
 AUTHOR
 
-       Philip Hazel <ph10@cam.ac.uk>
+       Philip Hazel
        University Computing Service,
        Cambridge CB2 3QG, England.
 
-Last updated: 10 September 2004
-Copyright (c) 1997-2004 University of Cambridge.
+Last updated: 28 February 2005
+Copyright (c) 1997-2005 University of Cambridge.
author	Philip Hazel <ph10@hermes.cam.ac.uk>	2005-06-15 08:57:10 +0000
committer	Philip Hazel <ph10@hermes.cam.ac.uk>	2005-06-15 08:57:10 +0000
commit	8ac170f35ed82789928f9e94beaa38991761a88c (patch)
tree	6cb1d0e1bffabcf23a2cb5d8a529b8f2aa5c2e2f /doc/doc-txt
parent	ef213c3b49bcc37a2882f81a76755f114c2a81cb (diff)