TRE API reference manual

The regcomp() functions

#include <tre/regex.h>

int regcomp(regex_t *preg, const char *regex, int cflags);
int regncomp(regex_t *preg, const char *regex, size_t len, int cflags);
int regwcomp(regex_t *preg, const wchar_t *regex, int cflags);
int regwncomp(regex_t *preg, const wchar_t *regex, size_t len, int cflags);

The regcomp() function compiles the regex string pointed to by regex to an internal representation and stores the result in the pattern buffer structure pointed to by preg. The regncomp() function is like regcomp(), but regex is not terminated with the null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwcomp() and regwncomp() functions work like regcomp() and regncomp(), respectively, but take a wide character (wchar_t) string instead of a byte string.

The cflags argument is a the bitwise inclusive OR of zero or more of the following flags (defined in the header <tre/regex.h>):

REG_EXTENDED
Use POSIX Extended Regular Expression (ERE) compatible syntax when compiling regex. The default syntax is the POSIX Basic Regular Expression (BRE) syntax, but it is considered obsolete.
REG_ICASE
Ignore case. Subsequent searches with the regexec family of functions using this pattern buffer will be case insensitive.
REG_NOSUB
Do not report submatches. Subsequent searches with the regexec family of functions will only report whether a match was found or not and will not fill the submatch array.
REG_NEWLINE
Normally the newline character is treated as an ordinary character. When this flag is used, the newline character ('\n', ASCII code 10) is treated specially as follows:
  1. The match-any-character operator (dot "." outside a bracket expression) does not match a newline.
  2. A non-matching list ([^...]) not containing a newline does not match a newline.
  3. The match-beginning-of-line operator ^ matches the empty string immediately after a newline as well as the empty string at the beginning of the string (but see the REG_NOTBOL regexec() flag below).
  4. The match-end-of-line operator $ matches the empty string immediately before a newline as well as the empty string at the end of the string (but see the REG_NOTEOL regexec() flag below).
REG_LITERAL
Interpret the entire regex argument as a literal string, that is, all characters will be considered ordinary. This is a nonstandard extension, compatible with but not specified by POSIX.
REG_NOSPEC
Same as REG_LITERAL. This flag is provided for compatibility with BSD.
REG_RIGHT_ASSOC
By default, concatenation is left associative in TRE, as per the grammar given in the base specifications on regular expressions of Std 1003.1-2001 (POSIX). This flag flips associativity of concatenation to right associative. Associativity can have an effect on how a match is divided into submatches, but does not change what is matched by the entire regexp.

The regex_t structure has the following fields that the application can read:

size_t re_nsub
Number of parenthesized subexpressions in regex.

The regcomp function returns zero if the compilation was successful, or one of the following error codes if there was an error:

REG_BADPAT
Invalid regexp. TRE returns this only if a multibyte character set is used in the current locale, and regex contained an invalid multibyte sequence.
REG_ECOLLATE
Invalid collating element referenced. TRE returns this whenever equivalence classes or multicharacter collating elements are used in bracket expressions (they are not supported yet).
REG_ECTYPE
Unknown character class name in [[:name:]].
REG_EESCAPE
The last character of regex was a backslash (\).
REG_ESUBREG
Invalid back reference; number in \digit invalid.
REG_EBRACK
[] imbalance.
REG_EPAREN
\(\) or () imbalance.
REG_EBRACE
\{\} or {} imbalance.
REG_BADBR
{} content invalid: not a number, more than two numbers, first larger than second, or number too large.
REG_ERANGE
Invalid character range, e.g. ending point is earlier in the collating order than the starting point.
REG_ERANGE
Out of memory.
REG_BADRPT
Invalid use of repetition operator. TRE never returns this.

The regexec() functions

#include <tre/regex.h>

int regexec(const regex_t *preg, const char *string, size_t nmatch,
            regmatch_t pmatch[], int eflags);
int regnexec(const regex_t *preg, const char *string, size_t len,
             size_t nmatch, regmatch_t pmatch[], int eflags);
int regwexec(const regex_t *preg, const wchar_t *string, size_t nmatch,
             regmatch_t pmatch[], int eflags);
int regwnexec(const regex_t *preg, const wchar_t *string, size_t len,
              size_t nmatch, regmatch_t pmatch[], int eflags);

The regexec() function matches the null-terminated string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions. The regnexec() function is like regexec(), but string is not terminated with a null byte. Instead, the len argument is used to give the length of the string, and the string may contain null bytes. The regwexec() and regwnexec() functions work like regexec() and regnexec(), respectively, but take a wide character (wchar_t) string instead of a byte string. The eflags argument is a bitwise OR of zero or more of the following flags:

REG_NOTBOL

When this flag is used, the match-beginning-of-line operator ^ does not match the empty string at the beginning of string. If REG_NEWLINE was used when compiling preg the empty string immediately after a newline character will still be matched.

REG_NOTEOL

When this flag is used, the match-end-of-line operator $ does not match the empty string at the end of string. If REG_NEWLINE was used when compiling preg the empty string immediately before a newline character will still be matched.

These flags are useful when different portions of a string are passed to regexec and the beginning or end of the partial string should not be interpreted as the beginning or end of a line.

If REG_NOSUB was used when compiling preg, nmatch is zero, or pmatch is NULL, then the pmatch argument is ignored. Otherwise, the submatches corresponding to the parenthesized subexpressions are filled in the elements of pmatch, which must be dimensioned to have at least nmatch elements.

The regmatch_t structure contains at least the following fields:

regoff_t rm_so
Byte offset from start of string to start of substring.
regoff_t rm_eo
Byte offset from start of string to the first character after the substring.

The length of a submatch in bytes can be computed by subtracting rm_eo and rm_so. If a parenthesized subexpression did not participate in a match, the rm_so and rm_eo fields for the corresponding pmatch element are set to -1. When a multibyte character set is in effect, the submatch offsets are given as byte offsets, not character offsets.

The regexec() functions return zero if a match was found, otherwise they return REG_NOMATCH to indicate no match, or REG_ESPACE to indicate that enough temporary memory could not be allocated to complete the matching operation.

reguexec()

#include <tre/regex.h>

typedef struct {
  int (*get_next_char)(tre_char_t *c, unsigned int *pos_add, void *context);
  void (*rewind)(size_t pos, void *context);
  int (*compare)(size_t pos1, size_t pos2, size_t len, void *context);
  void *context;
} tre_str_source;

int reguexec(const regex_t *preg, const tre_str_source *string, size_t nmatch,
             regmatch_t pmatch[], int eflags);

The reguexec() function works just like the other regexec() functions, except that the input string is read from user specified callback functions instead of a character array. This makes it possible, for example, to match regexps over arbitrary user specified data structures.

The tre_str_source structure contains the following fields:

get_next_char
This function must retrieve the next available character. If a character is not available, this must return a nonzero value. If a character is available, it must be stored to the space pointed to by c, and the integer pointer to by pos_add must be set to the number of units advanced in the input (the value must be >=1), and zero must be returned.
rewind
This function must rewind the input stream to the position specified by pos. Unless the regexp uses back references, rewind is not needed and can be set to NULL.
compare
This function compares two substrings in the input streams starting at the positions specified by pos1 and pos2 of length len. If the substrings are equal, compare must return zero, otherwise a nonzero value must be returned. Unless the regexp uses back references, compare is not needed and can be set to NULL.
context
This is a context variable, passed as the last argument to all of the above functions for keeping track of the internal state of the users code.

The position in the input stream is measured in size_t units. The current position is the sum of the increments gotten from pos_add (plus the position of the last rewind, if any). The starting position is zero. Submatch positions filled in the pmatch[] array are, of course, given using positions computed in this way.

For an example of how to use reguexec(), see the tests/test-str-source.c file in the TRE source code distribution.

The approximate matching functions

#include <tre/regex.h>

typedef struct {
  int cost_ins;
  int cost_del;
  int cost_subst;
  int max_cost;

  int max_ins;
  int max_del;
  int max_subst;
  int max_err;
} regaparams_t;

typedef struct {
  size_t nmatch;
  regmatch_t *pmatch;
  int cost;
  int num_ins;
  int num_del;
  int num_subst;
} regamatch_t;

int regaexec(const regex_t *preg, const char *string,
             regamatch_t *match, regaparams_t params, int eflags);
int reganexec(const regex_t *preg, const char *string, size_t len,
              regamatch_t *match, regaparams_t params, int eflags);
int regawexec(const regex_t *preg, const wchar_t *string,
              regamatch_t *match, regaparams_t params, int eflags);
int regawnexec( const regex_t *preg, const wchar_t *string, size_t len,
               regamatch_t *match, regaparams_t params, int eflags);

The regaexec() function searches for the best match in string against the compiled regexp preg, initialized by a previous call to any one of the regcomp functions.

The reganexec() function is like regaexec(), but string is not terminated by a null byte. Instead, the len argument is used to tell the length of the string, and the string may contain null bytes. The regawexec() and regawnexec() functions work like regaexec() and reganexec(), respectively, but take a wide character (wchar_t) string instead of a byte string.

The eflags argument is like for the regexec() functions.

The params struct controls the approximate matching parameters:

int cost_ins
The default cost of an inserted character, that is, an extra character in string.
int cost_del
The default cost of a deleted character, that is, a character missing from string.
int cost_subst
The default cost of a substituted character.
int max_cost
The maximum allowed cost of a match. If this is set to zero, an exact matching is searched for, and results equivalent to those returned by the regexec() functions are returned.
int max_ins
Maximum allowed number of inserted characters.
int max_del
Maximum allowed number of deleted characters.
int max_subst
Maximum allowed number of substituted characters.
int max_err
Maximum allowed number of errors (inserts + deletes + substitutes).

The match argument points to a regamatch_t structure. The nmatch and pmatch field must be filled by the caller. If REG_NOSUB was used when compiling the regexp, or match->nmatch is zero, or match->pmatch is NULL, the match->pmatch argument is ignored. Otherwise, the submatches corresponding to the parenthesized subexpressions are filled in the elements of match->pmatch, which must be dimensioned to have at least match->nmatch elements. The match->cost field is set to the cost of the match found, and the match->num_ins, match->num_del, and match->num_subst fields are set to the number of inserts, deletes, and substitutes in the match, respectively.

The regaexec() functions return zero if a match with cost smaller than params->max_cost was found, otherwise they return REG_NOMATCH to indicate no match, or REG_ESPACE to indicate that enough temporary memory could not be allocated to complete the matching operation.

Miscellaneous

#include <tre/regex.h>

int tre_have_backrefs(const regex_t *preg);
int tre_have_approx(const regex_t *preg);

The tre_have_backrefs() and tre_have_approx() functions return 1 if the compiled pattern has back references or uses approximate matching, respectively, and 0 if not.

Checking build time options

#include <tre/regex.h>

char *tre_version(void);
int tre_config(int query, void *result);

The tre_config() function can be used to retrieve information of which optional features have been compiled into the TRE library and information of other parameters that may change between releases.

The query argument is an integer telling what information is requested for. The result argument is a pointer to a variable where the information is returned. The return value of a call to tre_config() is zero if query was recognized, REG_NOMATCH otherwise.

The following values are recognized for query:

TRE_CONFIG_APPROX
The result is an integer that is set to one if approximate matching support is available, zero if not.
TRE_CONFIG_WCHAR
The result is an integer that is set to one if wide character support is available, zero if not.
TRE_CONFIG_MULTIBYTE
The result is an integer that is set to one if multibyte character set support is available, zero if not.
TRE_CONFIG_SYSTEM_ABI
The result is an integer that is set to one if TRE has been compiled to be compatible with the system regex ABI, zero if not.
TRE_CONFIG_VERSION
The result is a pointer to a static character string that gives the version of the TRE library.

The tre_version() function returns a short human readable character string which shows the software name, version, and license.

Preprocessor definitions

The header <tre/regex.h> defines certain C preprocessor symbols.

Version information

The following definitions may be useful for checking whether a new enough version is being used. Note that it is recommended to use the pkg-config tool for version and other checks in Autoconf scripts.

TRE_VERSION
The version string.
TRE_VERSION_1
The major version number (first part of version string).
TRE_VERSION_2
The minor version number (second part of version string).
TRE_VERSION_3
The micro version number (third part of version string).

Features

The following definitions may be useful for checking whether all necessary features are enabled. Use these only if compile time checking suffices (linking statically with TRE). When linking dynamically tre_config() should be used instead.

TRE_APPROX
This is defined if approximate matching support is enabled. The prototypes for approximate matching functions are defined only if TRE_APPROX is defined.
TRE_WCHAR
This is defined if wide character support is enabled. The prototypes for wide character matching functions are defined only if TRE_WCHAR is defined.
TRE_MULTIBYTE
This is defined if multibyte character set support is enabled. If this is not set any locale settings are ignored, and the default locale is used when parsing regexps and matching strings.