compile, step, or advance Subroutine

Purpose

Compiles and matches regular-expression patterns.

Note: Commands use the regcomp, regexec, regfree, and regerror subroutines for the functions described in this article.

Library

Standard C Library (libc.a)

Syntax

#define INIT declarations
#define GETC( ) getc_code
#define PEEKC( ) peekc_code
#define UNGETC(c) ungetc_code
#define RETURN(pointer) return_code
#define ERROR(val) error_code

#include <regexp.h>
#include <NLregexp.h>

char *compile (InStringExpBufferEndBufferEndOfFile)
char * ExpBuffer;
char  * InString, * EndBuffer;
int  EndOfFile;

int step (StringExpBuffer)
const char * String, *ExpBuffer;

int advance (String, ExpBuffer)
const char *String, *ExpBuffer;

Description

The /usr/include/regexp.h file contains subroutines that perform regular-expression pattern matching. Programs that perform regular-expression pattern matching use this source file. Thus, only the regexp.h file needs to be changed to maintain regular expression compatibility between programs.

The interface to this file is complex. Programs that include this file define the following six macros before the #include <regexp.h> statement. These macros are used by the compile subroutine:

Item Description
INIT This macro is used for dependent declarations and initializations. It is placed right after the declaration and opening { (left brace) of the compile subroutine. The definition of the INIT buffer must end with a ; (semicolon). INIT is frequently used to set a register variable to point to the beginning of the regular expression so that this register variable can be used in the declarations for the GETC, PEEKC, and UNGETC macros. Otherwise, you can use INIT to declare external variables that GETC, PEEKC, and UNGETC require.
GETC( ) This macro returns the value of the next character in the regular expression pattern. Successive calls to the GETC macro should return successive characters of the pattern.
PEEKC( ) This macro returns the next character in the regular expression. Successive calls to the PEEKC macro should return the same character, which should also be the next character returned by the GETC macro.
UNGETC(c) This macro causes the parameter c to be returned by the next call to the GETC and PEEKC macros. No more than one character of pushback is ever needed, and this character is guaranteed to be the last character read by the GETC macro. The return value of the UNGETC macro is always ignored.
RETURN(pointer) This macro is used for normal exit of the compile subroutine. The pointer parameter points to the first character immediately following the compiled regular expression. This is useful for programs that have memory allocation to manage.
ERROR(val) This macro is used for abnormal exit from the compile subroutine. It should never contain a return statement. The val parameter is an error number. The error values and their meanings are:
Error
Meaning
11
Interval end point too large
16
Bad number
25
\ digit out of range
36
Illegal or missing delimiter
41
No remembered search String
42
\ (?\) imbalance
43
Too many \.(
44
More than two numbers given in \{ \}
45
} expected after \.
46
First number exceeds second in \{ \}
49
[ ] imbalance
50
Regular expression overflow
70
Invalid endpoint in range

The compile subroutine compiles the regular expression for later use. The InString parameter is never used explicitly by the compile subroutine, but you can use it in your macros. For example, you can use the compile subroutine to pass the string containing the pattern as the InString parameter to compile and use the INIT macro to set a pointer to the beginning of this string. The example in the Examples section uses this technique. If your macros do not use InString, then call compile with a value of ((char *) 0) for this parameter.

The ExpBuffer parameter points to a character array where the compiled regular expression is to be placed. The EndBuffer parameter points to the location that immediately follows the character array where the compiled regular expression is to be placed. If the compiled expression cannot fit in (EndBuffer-ExpBuffer) bytes, the call ERROR(50) is made.

The EndOfFile parameter is the character that marks the end of the regular expression. For example, in the ed command, this character is usually / (slash).

The regexp.h file defines other subroutines that perform actual regular-expression pattern matching. One of these is the step subroutine.

The String parameter of the step subroutine is a pointer to a null-terminated string of characters to be checked for a match.

The Expbuffer parameter points to the compiled regular expression, obtained by a call to the compile subroutine.

The step subroutine returns the value 1 if the given string matches the pattern, and 0 if it does not match. If it matches, then step also sets two global character pointers: loc1, which points to the first character that matches the pattern, and loc2, which points to the character immediately following the last character that matches the pattern. Thus, if the regular expression matches the entire string, loc1 points to the first character of the String parameter and loc2 points to the null character at the end of the String parameter.

The step subroutine uses the global variable circf, which is set by the compile subroutine if the regular expression begins with a ^ (circumflex). If this variable is set, step only tries to match the regular expression to the beginning of the string. If you compile more than one regular expression before executing the first one, save the value of circf for each compiled expression and set circf to that saved value before each call to step.

Using the same parameters that were passed to it, the step subroutine calls a subroutine named advance. The step function increments through the String parameter and calls the advance subroutine until it returns a 1, indicating a match, or until the end of String is reached. To constrain the String parameter to the beginning of the string in all cases, call the advance subroutine directly instead of calling the step subroutine.

When the advance subroutine encounters an * (asterisk) or a \{ \} sequence in the regular expression, it advances its pointer to the string to be matched as far as possible and recursively calls itself, trying to match the rest of the string to the rest of the regular expression. As long as there is no match, the advance subroutine backs up along the string until it finds a match or reaches the point in the string that initially matched the * or \{ \}. You can stop this backing-up before the initial point in the string is reached. If the locs global character is equal to the point in the string sometime during the backing-up process, the advance subroutine breaks out of the loop that backs up and returns 0. This is used for global substitutions on the whole line so that expressions such as s/y*//g do not loop forever.

Note: In 64-bit mode, these interfaces are not supported: they fail with a return code of 0. In order to use the 64-bit version of this functionality, applications should migrate to the fnmatch, glob, regcomp, and regexec functions which provide full internationalized regular expression functionality compatible with ISO 9945-1:1996 (IEEE POSIX 1003.1) and with the UNIX98 specification.

Parameters

Item Description
InString Specifies the string containing the pattern to be compiled. The InString parameter is not used explicitly by the compile subroutine, but it may be used in macros.
ExpBuffer Points to a character array where the compiled regular expression is to be placed.
EndBuffer Points to the location that immediately follows the character array where the compiled regular expression is to be placed.
EndOfFile Specifies the character that marks the end of the regular expression.
String Points to a null-terminated string of characters to be checked for a match.

Examples

The following is an example of the regular expression macros and calls:

#define INIT         register char *sp=instring;
#define GETC()         (*sp++)
#define PEEKC()         (*sp)
#define UNGETC(c)      (--sp)
#define RETURN(c)      return;
#define ERROR(c)      regerr()

#include <regexp.h>
 . . .
compile (patstr,expbuf, &expbuf[ESIZE], '\0');
 . . .
if (step (linebuf, expbuf))
   succeed( );
 . . .