lex(1)

NAME

lex − lexical analysis program generator

SYNOPSIS

lex [ −cntv ] [ −e | −w ] [ −V −Q [ y | n ] ][ filename ] ...

DESCRIPTION

lex generates programs to be used in simple lexical analysis of text. Each filename (the standard input by default) contains regular expressions to search for, and actions written in C to be executed when expressions are found.

A C source program, lex.yy.c is generated, to be compiled as follows:

cc lex.yy.c −ll

This program, when run, copies unrecognized portions of the input to the output, and executes the associated C action for each regular expression that is recognized. The actual string matched is left in yytext, an external character array (a wchar_t array when the −w option is given).

Matching is done in order of the strings in the file. The strings may contain square braces to indicate character classes, as in [abx−z] to indicate a, b, x, y, and z; and the operators ∗, + and ?, which mean, respectively, any nonnegative number, any positive number, or either zero or one occurrences of the previous character or character-class. The “dot” character (‘.’) is the class of all characters except NEWLINE.

Parentheses for grouping and vertical bar for alternation are also supported. The notation r{d,e} in a rule indicates instances of regular expression r between d and e. It has a higher precedence than |, but lower than that of ∗, ?, +, or concatenation. The ^ (carat character) at the beginning of an expression permits a successful match only immediately after a NEWLINE, and the $ character at the end of an expression requires a trailing NEWLINE.

The / character in an expression indicates trailing context; only the part of the expression up to the slash is returned in yytext, although the remainder of the expression must follow in the input stream.

An operator character may be used as an ordinary symbol if it is within ´’ symbols or is preceded by ‘\’.

Three subroutines defined as macros are expected: input() to read a character; unput(c) to replace a character read; and output(c) to place an output character. They are defined in terms of the standard streams, but you can override them. For C++ code, input() is renamed lex_input(), and ouput() is renamed lex_output() to avoid name conflicts with iostreams. The program generated is named yylex(), and the lex library libl.a contains a main() which calls it. The action REJECT on the right side of the rule rejects this match and executes the next suitable match; the function yymore() accumulates additional characters into the same yytext; and the function yyless(p) pushes back the portion of the string matched beginning at p, which should be between yytext and yytext+yyleng. The macros input and output use files yyin and yyout to read from and write to, defaulted to stdin and stdout, respectively.

In a lex program, any line beginning with a blank is assumed to contain only C text and is copied; if it precedes %% it is copied into the external definition area of the lex.yy.c file. All rules should follow a %%, as in YACC. Lines preceding %% which begin with a nonblank character define the string on the left to be the remainder of the line; it can be used later by surrounding it with {}. Note: curly brackets do not imply parentheses; only string substitution is done.

The external names generated by lex all begin with the prefix yy or YY.

Certain table sizes for the resulting finite-state machine can be set in the definitions section:

%p n
number of positions is n (default 2000)

%n n
number of states is n (default 500)

%e n
number of parse tree nodes is n (default 1000)

%a n
number of transitions is n (default 3000)

The use of one or more of the above automatically implies the −v option, unless the −n option is used.

Programs generated by lex(1) need either the −e or −w option to handle input that contains EUC characters from supplementary codesets. If neither of these options is specified, yytext is of the type char[], and the generated program can handle only ASCII characters.

When the −e option is used, yytext is of the type unsigned char[] and yyleng gives the total number of bytes in the matched string. With this option, the macros input(), unput(c), and output(c) should do a byte-based I/O in the same way as with the regular ASCII lex(1). Two more variables are available with the −e option, yywtext and yywleng, which behave the same as yytext and yyleng would under the −w option.

When the −w option is used, yytext is of the type wchar_t[] and yyleng gives the total number of characters in the matched string. If you supply your own input(), unput(c), or output(c) macros with this option, they must return or accept EUC characters in the form of wide character (wchar_t). This allows a different interface between your program and the lex internals, to expedite some programs.

When either the −e or −w option is used, the generated C program must be linked with the wide character library libw.a using the −lw linker flag.

Pattern Matching

When either the −e or −w option is used, patterns used in rules can include characters from both primary and supplementary codesets. The generated program performs pattern matching correctly on an input stream containing EUC characters from supplementary codesets.

You may use any valid EUC characters in a character range [A−Z] as long as A and Z belong to the same codeset.

"." matches any character from any codeset (except NEWLINE).

International Caveats

Start condition names must consist solely of ASCII characters.

The "%T" directive can not be used when either the −w or −e option is used.

The default main() found in the lex library (libl.a) does not have a setlocale(3C) call. Thus, the resulting program would not recognize non-ASCII characters correctly. You have to supply your own main() in order to have your program handle EUC characters correctly. The simplest main() would be:

#include <locale.h>
main(){
setlocale(LC_ALL, "");
yylex();
}

OPTIONS

−c Indicates C actions and is the default.

−e Generate a program that can handle EUC characters (cannot be used with the −w option).
yytext[] is of type unsignedchar[].

−n Opposite of −v; −n is the default.

−t Place the result on the standard output instead of in file lex.yy.c.

−v Print a one-line summary of statistics of the generated analyzer.

−w Generate a program that can handle EUC characters (cannot be used with the −e option).
Unlike the −e option, yytext[] is of type wchar_t[].

−V Print out version information on standard error.

−Q[y|n]
Print out version information to output file lex.yy.c by using −Qy . The −Qn option does not print out version information and is the default.

EXAMPLES

The command line,

lex lexcommands

draws lex instructions from the file lexcommands, and places the output in lex.yy.c.

The following example lex program converts uppercase to lower, removes blanks at the end of lines, and replaces multiple blanks by single blanks.

%%
[A−Z]putchar (yytext[0]+´a´−´A´);
[ ]+$;
[ ]+putchar(´ ´);

INTERNATIONAL EXAMPLES

The following is a similar program for the "japanese" locale environment:

%%
[\x30001221-\x30001273]putwchar (yytext[0]+0x0080);
[ \x300010a1]+$;
[ \x300010a1]+putchar(´ ´);
%%
#include <locale.h>
main(){
setlocale(LC_ALL, "");
yylex();
}

This program converts every hiragana character (of which the EUC wide character value is between 0x30001221 and 0x30001273) to the corresponding katakana character. It also recognizes double-space character (0x300010a1). 0x0080 is the offset between corresponding hiragana and katakana characters when represented in wide characters. Note that use of the hexadecimal escape sequence in this example is not really needed. The corresponding EUC characters could have been used instead.
This program must be compiled with the −lw option and linked with the wide character library libw.a. Compilation and execution must be done in an environment where either the LANG or LC_CTYPE environment variable is set to japanese. The command line for compiling this program would be:

% lex −w sample.l
% cc −o sample lex.yy.c −ll −lw

FILES

lex.yy.c default output file when −t is not specified

/usr/ccs/lib/libl.a lex library

ncform

nceucform C-program prototypes

NOTES

Ratfor is no longer supported as a host language.

The way to use hexadecimal escape sequences for multibyte characters differs from the versions of lex of previous release of SunOS Asian Language Environment, namely JLE, KLE, CLE and HLE. In these versions, a multibyte character was written as a sequence of hexadecimal escape sequences, one per byte, rather than as one hexadecimal escape sequence representing the character’s wide character value.

SunOS 5.4 — Last change: 5 Mar 1992

Museum

Related Articles