lex(1)
NAME
lex − generate programs for lexical analysis of text
SYNOPSIS
lex [-rctvn] [-Xsecondaryn...] [file] ...
DESCRIPTION
lex generates programs to be used in simple lexical analysis of text.
The input files contain strings and expressions to be searched for, and C text to be executed when strings are found. Multiple files are treated as a single file. If no files are specified, the standard input is used.
A file lex.yy.c is generated which, when loaded with the library, copies the input to the output except when a string specified in the file is found; then the corresponding program text is executed. The actual string matched is left in yytext, an external character array. Matching is done in order of the strings in the file. The strings can contain square brackets to indicate character classes, as in [abx-z] to indicate a, b, x, y, and z; and the operators *, +, and ? mean respectively any non-negative number of, any positive number of, and either zero or one occurrences of, the previous character or character class. The character . is the class of all ASCII characters except new-line. When 16-bit support is enabled, the character . also matches all valid 16-bit characters under the current locale in addition to the the ASCII characters. Parentheses for grouping and vertical bar for alternation are also supported. The notation r{d,e} in a rule indicates between d and e instances of regular expression r. It has higher precedence than |, but lower than *, ?, +, and concatenation. The character ^ at the beginning of an expression permits a successful match only immediately after a new-line, and the character $ at the end of an expression requires a trailing new-line. The character / in an expression indicates trailing context; only the part of the expression up to the slash is returned in yytext, but the remainder of the expression must follow in the input stream. An operator character may be used as an ordinary symbol if it is enclosed between double quotes (") or preceded by \. Thus [a-zA-Z]+ matches a string of letters.
Three subroutines defined as macros are expected: input() to read a character; unput(c) to replace a character read; and output(c) to place an output character. They are defined in terms of the standard streams, but can be overridden. The program generated is named yylex(), and the library contains a main() which calls setlocale() then calls yylex(). Users can define their own version of main(), but if lex.yy.c is generated using -w or -m and the locale used is other than the default, the user-defined main() routine must include a call similar to setlocale (LC_ALL, yylocale) before yylex() is called, or the actions of the generated scanner will be undefined. The generated lex.yy.c program includes the appropriate declaration and initialization of yylocale. The action REJECT on the right side of the rule causes this match to be rejected and the next suitable match executed; the function yymore() accumulates additional characters into the same yytext; and the function yyless(p) pushes back the portion of the string matched beginning at p, which should be between yytext and yytext+yyleng. The macros input and output use files yyin and yyout to read from and write to, defaulted to the standard input and the standard output, respectively.
Any line beginning with a blank is assumed to contain only C text and is copied; if it precedes %% it is copied into the external definition area of the lex.yy.c file. All rules should follow a %%, as in yacc (see yacc(1)). Lines preceding %% that begin with a non-blank character define the string on the left to be the remainder of the line. This is called a definition, and can be called out later by surrounding it with {}. Note that curly brackets do not imply parentheses; only string substitution is done.
Options
lex recognizes the following options, which must appear before any files:
-r Indicates ratfor actions (see ratfor(1));
-c Indicates C actions − this is the default;
-m Enables basic multibyte support. This option allows intermixed ASCII and 16-bit characters in a lex specification. 16-bit characters can appear in quoted and unquoted strings, regular expressions, character classes, definitions, definition names, and as endpoints of ranges. This option also enables the . character to recognize any valid 16-bit character as well as ASCII characters. With this option, meta-characters such as *, +, and ? can be applied to 16-bit characters the same way they are applied to ASCII characters.
-n Suppresses printing of the - summary.
-t Causes the lex.yy.c program to be written instead to the standard output;
-v Provides a one-line summary of statistics for the machine generated;
-w Enables basic multibyte support and causes the underlying data type returned to the user, yytext, to be an array of type wchar_t. This option takes precedence over the -m option.
-Xsecondaryn Resets the sizes of certain internal lex tables. secondary is a single letter from the set {dDsSac} that specifies the table to be reset; n is the new size:
d Table of definitions; default=200.
D Table of characters in definition strings; default=5000.
s Table of start conditions; default=50.
S Table of characters in start condition names; default=500.
c Array table for storing character classes; default=1000.
a Right context/action array table; default=100.
If an array overflows, lex issues a fatal error message including a suggestion of which table to reset. For example:
Definitions too long, try -XD option
Certain table sizes for the resulting finite state machine can be set in the definitions section:
%p n number of positions is n (default is 2500);
%q n number of positions for one state is n (default is 300);
%n n number of states is n (default is 500);
%e n number of parse tree nodes is n (default is 1000);
%a n number of transitions is n (default is 2000).
%k n number of packed character classes is n (default is 1000);
%o n size of output array is n (default is 3000);
The use of one or more of the preceding table options automatically implies -v unless -n is specified.
Other recognized directives in the definitions section:
%l locale specifies the value of the LANG environment variable when the final scanner is run. locale is a quoted or unquoted string such as japanese or chinese-t. The character string yylocale is set to the value of locale at runtime and the default main() subroutine provided in the lex library, libl, calls setlocale (LC_ALL, yylocale). locale is also used to evaluate character attributes when reading the input specification and is used for analyzing the character set when building the tables in lex.yy.c. If the value of locale indicates that the basic character size is 16-bits, it will automatically enable the -m option.
External names generated by lex all begin with the prefix yy or YY.
EXTERNAL INFLUENCES
Environment Variables
LC_CTYPE determines the size of the characters in use unless overridden by the %l locale source directive.
LC_MESSAGES determines the language in which messages are displayed.
LANG is used as a default if LC_CTYPE or LC_MESSAGES is not set.
International Code Set Support
Single- and multi-byte character code sets are supported. Multi-byte character code set support is enabled with -w or -m.
EXAMPLES
D [0-9]
%%
if printf("IF statement\n");
[a-z]+ printf("tag, value %s\n",yytext);
0{D}+ printf("octal number %s\n",yytext);
{D}+ printf("decimal number %s\n",yytext);
"++" printf("unary op\n");
"+" printf("binary op\n");
"/*" { loop:
while (input() != ’∗’);
switch (input())
{
case ’/’: break;
case ’∗’: unput(’∗’);
default: goto loop;
}
}
WARNINGS
The -r option is not yet fully operational.
The ^ operator is not supported in character classes, [], containing multi-byte characters.
The token buffer in the program built by lex is of fixed length,
yytext[YYLMAX]
where YYLMAX is defined to be 200 unsigned characters or 400 unsigned characters if -m has been specified and LC_CTYPE indicates a multi-byte character set. Overflow of this array is not detected in the lex.yy.c program.
SEE ALSO
LEX − Lexical Analyzer Generator in C Programming Tools.
STANDARDS CONFORMANCE
lex: SVID2, XPG2, XPG3, POSIX.2
Hewlett-Packard Company — HP-UX Release 9.0: August 1992