awk(C) 06 January 1993 awk(C) Name awk: awk, oawk, nawk - pattern scanning and processing language Syntax awk [ -Fsep ] [ [-e] 'prog' ] ... [ -f progfile ] ... [ [-v] var=value ... ] [ file ... ] Description awk is an interpreted pattern-matching language with a wide range of applications. See the chapter on awk in the User's Guide for a complete discussion of its use. (nawk and oawk are alternative versions of awk. awk should be used in preference to nawk or oawk. See ``Notes'' below for more details.) You can enter an awk program (prog) directly from the command-line, enclosing it in single quotes to prevent interpretation by the shell. The -e flag preceding prog is optional. For longer awk programs, it may be more convenient to fetch them from a file (progfile); this is done with the -f option. You can specify multiple -e programs and -f files; they are concatenated together (with intervening newlines) to form the program that is executed. (This is like the -e and -f options in sed(C).) Input files are read in order. If no files are given on the command line, the standard input is used. You can change the awk field separator on the command line with the -fsep option, where the regular expression sep is the new delimiter. You can also specify the field separator as a single character; this sets the field separator to be that character. awk -Ft is a special case that sets the field separator to a tab. (The field separator can also be changed within an awk program using the variable FS.) You can set the value of variables you are going to use in the awk pro- gram from the command line using var=value, where var is the variable and value is its initial value. This can be preceded with an optional -v. What awk does with your program After awk checks the syntax of your program, it reads each record (gen- erally, each line) of the input and attempts to match it against the pat- terns specified in the program. For each pattern in the program, there may be an associated action performed when an input record matches the pattern. Actions can be made up of a single action statement, like print, or of a combination of statements. A pattern-action statement has the form: pattern { action } Either pattern or action may be omitted. If there is no action with a pattern, the matching line is printed. If there is no pattern with an action, the action is performed on every input line. Programming conventions Pattern-action statements, and individual statements within actions, gen- erally begin on a new line. The opening brace ({) must be on the same line as the pattern for which the actions should be performed. Multiple action statements may appear on a single line if they are separated by semicolons (;). A newline can be hidden with a backslash (\), so you can use backslash- newline to continue a long line. Comments in awk are introduced by a number sign (#) and end with the end of the line. Comments can appear anywhere in a line. Blank lines and whitespace (blanks and tabs) in an awk program are ignored. Fields, records, and built-in variables awk presumes that each field in a record is separated by whitespace, and that each record consists of one line of input. Both of these defaults can be modified. You can change the field separator on the command line, as discussed ear- lier, using the -Fsep option. You can also reset the value of the input field separator variable FS from within your awk program. FS can be set to any regular expression. The following action is a special case that resets FS to its default behavior: BEGIN { FS = " " } The BEGIN in this example is a special pattern that matches before the first record is read; this is the mechanism awk provides for doing intro- ductory processing. Setting FS to a single blank is equivalent to: BEGIN { FS = "[ \t]+" } That is, setting FS to a single blank tells awk to regard any combination of blanks and tabs (any whitespace) as a field separator. Note that once you set the input field separator to something other than a single blank (that is, to all whitespace), leading whitespace (before the first field) is no longer ignored. awk is designed to consider each line of input as a complete record, but you can get awk to recognize multiline records by resetting the variable RS. To get awk to recognize multiline records, set RS to the null string: BEGIN { RS = "" } Now, awk will presume that records are separated by one or more blank lines. When you reset RS like this to use multiline records, newline is always considered a field separator, no matter what the value of FS is. To restore the default record separator, reset RS to a newline: { RS = "\n" } You can address any field in the input record using the syntax $1, $2, etc., where $1 is the first field in a record, $2 is the second field, and so on. The entire record is referred to as $0. Fields can also be referred to in relation to the built-in field vari- ables, for example, for a five-field record: $(NF - 2) would refer to the third field. The NF in this example is a built-in variable awk provides that counts the number of fields in a current record. (Thus, $NF refers to the last field in the current record.) The following list shows all the built-in variables in awk: _________________________________________________________________________ Variable Meaning _________________________________________________________________________ ARGC number of command-line arguments plus 1 ARGV array of command-line arguments (ARGV[0 ... ARGC- 1]) ENVIRON array of environment variables, indexed by the name of the variable FILENAME name of current input file FNR input record number in current file FS input field separator (default: any whitespace) NF number of fields in current input record NR number of records read so far OFMT output format for numbers (default: "%.6g"; see printf(S)) OFS output field separator (default: blank) ORS output record separator (default: newline) RS input record separator (default: newline) RSTART index of first character matched by match() RLENGTH length of string matched by match() SUBSEP separates multiple subscripts in array elements (default: ``\034'') Patterns Patterns can be any of the following: BEGIN END /expr/ relational expression pattern && pattern pattern ||pattern (pattern) !pattern pattern1,pattern2 BEGIN and END match before the first line is read, and after the last line has been read, respectively. All other patterns can contain extended regular expressions, like in egrep. See grep(C) and ed(C) for the pattern-matching syntax of extended regular expressions. (In the following discussion, extended regular expressions will be referred to simply as regular expressions.) You can create a string matching pattern using a regular expression in one of three ways: /regexpr/ This will match the current record if regexpr is contained anywhere in the current record. expression ~ /regexpr/ This will match if regexpr is contained anywhere in the string value of expression. expression !~ /regexpr/ This will match if regexpr is not contained any- where in the string value of expression. A relational expression is made up of two numeric or string expressions compared with one of the following operators: _________________________________________________________________________ Operator Meaning _________________________________________________________________________ < less than <= less than or equal to > greater than >= greater than or equal to == equal to != not equal to When strings are compared using relational operators (<, <=, >, >=), they are compared character by character using the sort order provided by the machine, which is usually the ASCII sort order. One string is less than another string if it would appear earlier (before) the other in the sort order. When one operand in a relational expression is a string, the other operand is converted to a string as well and they are compared using the method described above. Patterns can be joined using the logical operators && (AND) and || (OR). When patterns are joined like this, the pattern matches the current record if the entire pattern evaluates to true (nonzero or nonnull). A pattern can be negated using the ! logical NOT operator. Parentheses may be used for grouping patterns. pattern && pattern matches a record when both the first pattern and the second pattern match the record. pattern ||pattern matches a record when either the first pattern or the second pattern matches the record. !pattern means ``does not match pattern.'' That is, !pattern matches every record that is not matched by pattern. pattern1, pattern2 defines a matching range. The accompanying action is performed for all records that match from the first occurrence of pat- tern1 to the following occurence of pattern2, inclusive. (The action is performed for the lines containing pattern1 and pattern2, as well as all the lines in between.) Actions The actual work your awk program does occurs in the action part of the program. Action statements can be made up of: + expressions (numeric and string constants, variables, array references, and so on) + flow control statements (branches or loops) + built-in arithmetic or string functions or functions you define yourself Variables in awk are not explicitly declared; they simply spring into existence when they are first used. awk determines from the context whether a variable is numeric or string. Numeric variables are automati- cally initialized to 0; string variables are automatically initialized to the empty string (""). (See ``Number or string'' below, and the chapter on awk in the User's Guide for more information about variable types and type coercion in awk.) Values are assigned to variables in the usual way in awk: a = 100 creates a numeric variable a with the value ``100''. You can assign several variables in a single statement: water = oil = "wet" This creates two string variables, water and oil, and sets them both to contain the string ``wet''. Assignment operators are evaluated from right to left. The following assignment operators are available; the shorthand assign- ment notation is borrowed from the C programming language: _________________________________________________________________________ Operator Meaning _________________________________________________________________________ a=b set a equal to b a+=b set a equal to a + b a-=b set a equal to a - b a*=b set a equal to a * b a/=b set a equal to a / b a%=b set a equal to a % b; a becomes the remainder of a divided by b a^=b set a equal to a ^ b; a becomes ab awk offers the usual arithmetic operators: ``+'' (add), ``-'' (sub- tract), ``*'' (multiply), ``/'' (divide), ``%'' (modulo; divide and give remainder), ``^'' (exponentiation; ``**'' is a synonym). The unary ``+'' (plus) and ``-'' (minus) are also available. All arithmetic in awk is done in floating point. Relational expressions in action statements use the same operators as relational expressions in patterns; consult the relational operators table in ``Patterns'' above. The logical AND and logical OR (&& and ||) are also available, as well as the logical NOT (!, as in !expr). There is also a conditional operator: ``?'': expression1 ? expression2 : expression3 expression is evaluated, and if it is non-empty and non-zero, then the expression has the value of expression2. Otherwise, it has the value of expression3. Variables can be incremented using prefix or postfix notation, as in C. x++ and ++x are both equivalent to x = x + 1, and x-- and --x both are equivalent to x = x-1. The difference between prefix (++x) and postfix (x++) is when x assumes its new value. In prefix notation, x is immedi- ately incremented; in postfix notation, the current value of x is used and then x is incremented. Parentheses can be used to alter the order of evaluation in arithmetic and relational expressions. The following table of precedence shows all the available action state- ment operators and the order in which they are evaluated. The table is in decreasing order of precedence; operators higher in the table are evaluated before operators lower in the table. _________________________________________________________________________ Operator Meaning _________________________________________________________________________ $ field ++ -- increment, decrement (prefix and postfix) ^ exponentiation (** is a synonym) ! logical negation + - unary plus, unary minus * / % multiply, divide, mod + - add, subtract (no explicit operator) string concatenation < <= > >= != == relationals ~ !~ regular expression match, negated match in array membership && logical AND || logical OR ?: conditional expression = += -= *= /= %= ^= assignment All of these operators are evaluated from left to right (they are left associative), except for the assignment operators, the conditional expression operator, and exponentiation, which are evaluated from right to left (they are right associative). Arrays One-dimensional arrays are available in awk. Like other variables in awk, arrays and array elements do not need to be declared; they come into existence upon their first use. awk allows you to use strings as array subscripts; arrays that do this are called associative arrays. This lets you group together data quite simply. Say we have a data file listing employee names, department names, and the number of sick days the employee has taken: Steve Engineering 2 Chris Engineering 1 Susannah Documentation 0 Vipin Sales 2 Connie Marketing 3 Matt Documentation 1 Nancy Sales 1 Nigel Documentation 0 The first field, $1, contains the employee name; the second field, $2, contains the department, and the third field, $3, contains the number of sick days for that employee. To accumulate the number of sick days in each department: { sickness[$2] += $3 } This creates the array sickness, which uses the values in the second field (``Engineering'', ``Documentation'', ``Sales'', and ``Marketing'') as its subscripts. The sick day totals in field three are then collected under the appropriate subscript. The construct: for (i in arr) statement does statement for every subscript i in the array arr. Subscripts are looped over in a random order. If the value of i is changed within statement, unpredictable results may occur. The split function splits input into subscripts in an array. It takes the form: split(string,arr,fs) where string is the string you want to split, arr is the array into which you want to split it, and fs is the field separator on which you want to split. The first component of string is stored in arr[1], the second in arr[2] and so on. The return value is the number of fields. Elements can be deleted from an array with the delete statement: delete arr [subscript] After this is done, arr [subscript] no longer exists. awk does not support multi-dimensional arrays, but this can be simulated by using a list of subscripts; see the User's Guide for details. Flow of control awk uses branching and looping statements borrowed from the C programming language. In all the following constructs, a single statement can be replaced by a statement list enclosed in { braces }. Each statement in a statement list should begin on a new line or after a semicolon. The following constructs are available: if (expression) statement1 else statement2 If expression is non-zero and non-empty, do statement1; otherwise, do statement2. The ``else statement2'' is optional. If there are several ifs together with an else, the else belongs with the nearest preceding if. while (expression) statement While expression is non-zero and non-empty, statement is executed. for (expression1; expression; expression2) statement This is a generalized form of the while statement. The for statement is the same as: expression1 while (expression2) { statement expression3 } All three expressions are optional. This is often used to go through a loop based on the value of a counter, where expression1 is used to initialize a counter; expression is the test; and expression2 increments the counter. While expression is non- empty and non-zero, statement is executed. do statement while (expression) statement is repeatedly executed until expression becomes null or zero. The break, continue, and next statements can be used to break out of loops that would otherwise keep going. break drops out of the innermost while, for, or do loop. continue causes the next iteration of the loop to begin. Execution will go to the test expression in a while or do loop, and to expression3 in a for loop. next reads the next record and starts the main input loop again. exit will go straight to the END statements, if there are any. If exit occurs in an END statement, the program itself exits. If a numeric expression is given after exit, this expression is taken as the exit status for the awk program. Output The print and printf statements are used to write output in awk. print expr1,expr2, ...,exprn will print the string value of each expression separated by the output field separator, followed by the output record separator. Without the commas, the expressions are concatenated. print by itself is an abbreviation for print $0. To print an empty line use: print "" The printf function in awk is like printf(S) in C: printf format, expr1, expr2, ... , expn format can be made up of regular characters, which are printed as-is, escaped special characters, such as Tab (\t) or Newline (\n), and format keyletters that specify how to print the expressions following the for- mat. Format keyletters begin with a ``%'' and can be preceded with a width specification, a precision statement, and/or an instruction to left-justify an expression in its field. The first expression replaces the first formatting keyletter, and so on. If a print or printf statement includes an expression with the greater- than operator (>), this expression should be enclosed in parentheses to avoid confusion between the greater-than operator and redirection into a file. For example: { print $0 $2 > $3 } This statement says ``print the record and then field 2 into a file named by field 3,'' while: { print $0 ($2 > $3) } says ``print the record, followed by a 1 if field 2 is greater than field 3, or a 0 it is not.'' printf keyletters are: _________________________________________________________________________ Keyletter Prints expr as _________________________________________________________________________ %c the ASCII character referred to by the least significant 8 bits of the numeric value of expr; truncates expr to the nearest integer %d a decimal integer; truncates expr to the nearest integer %e scientific notation using the form [-]d.ddddddE[+-]dd %f scientific notation using the form [-]ddd.dddddd %g the shorter of e or f conversion, with nonsignificant zeros suppressed %o an unsigned octal number %s a string %x unsigned hexadecimal number %% prints a ``%'', no argument is converted The following escape sequences are recognized within regular expressions and strings: _________________________________________________________________________ Escape sequence Meaning _________________________________________________________________________ \b Backspace \f Formfeed \n Newline \r Carriage return \t Tab \ddd octal value ddd Output can be redirected into files using: > filename and >> filename Files are opened only once using the redirection operator. The first form will overwrite whatever is in filename, if filename already exists, and will create filename if it does not exist. The second form will append output to filename. To send output to a pipe, use: | command-line where command-line is the command line to which you want to send the out- put. Filenames and command lines can be expressions, variables, or literal filenames or command lines. If you want to use a literal filename or command line, you must enclose it in double quotes, other- wise, awk will treat it as a variable. There is a limit to how many files and pipes you can open in an awk pro- gram (see ``Limits'' below). Use the close statement to close files or pipes: close(filename) close(command-line) where filename or command-line is the open file or pipe. Input awk provides the getline function to read in successive lines of input from a file or a pipe. getline getline by itself takes the next record of input as $0 and sets NF, NR, and FNR. getline <file The next record from file becomes $0; NF is set. getline var The next record of input is placed in var; NR and FNR are set. getline var <file The next record in file is placed in var. command | getline The output of command is piped to getline. $0 and NF are set. command | getline var The output of command is piped to getline and stored in var. All forms of getline return 1 for successful input, 0 for end of file, and -1 for an error. To read input from a file until the file runs out, use: while ( ( getline x < file ) > 0) { ... } The ``> 0'' is needed so that the test catches a -1 error returned from getline. Otherwise, the while loop would read -1 as true, since it is non-zero. Functions The following arithmetic functions are built into awk: __________________________________________________________ Function Returns __________________________________________________________ atan2(y,x)arctangent of y/x in the range -pi to pi cos(x) cosine of x, with x in radians exp(x) exponential function of x, e^x int(x) integer part of x; truncated toward 0 when x > 0 log(x) natural (base e) logarithm of x rand() random number r, where 0 <= r < 1 sin(x) sine of x, with x in radians sqrt(x) square root of x srand() set the seed for rand() from the time of day srand(x) x is new seed for rand() The string functions are: gsub(r,s,t) globally substitutes the string s for the regular expression r in the string t. If t is omitted, substitutions are made in the current record ($0). The number of substitutions is returned. index(s,t) returns the position in string s where string t first occurs, or 0 if it does not occur at all. length(s) returns the length of its argument taken as a string, or of the whole record if there is no argument. match(s,re) returns the position in string s where the regular expression re occurs, or 0 if it does not occur at all. RSTART is set to the starting position (which is the same as the returned value), and RLENGTH is set to the length of the matched string. split(s,a,fs) splits the string s into array elements a[1], a[2], a[n], and returns n. The separation is done with the regular expression fs or with the field separator FS if fs is not given. sprintf(format, expr,expr, ... ) formats the expressions according to the printf format and returns the resulting string. sub(r,s,t) substitutes the string s in place of the first instance of the regular expression r in string t and returns the number of sub- stitutions. If t is omitted, awk substitutes in the current record ($0). substr(s,p) returns the suffix of s starting at position p. substr(s,p,n) returns the n-character substring of s that begins at position p. toupper(s) returns a copy of the string s with lowercase letters converted to uppercase. tolower(s) returns a copy of the string s with uppercase letters converted to lowercase. awk provides the system function for running commands: system(command-line) executes command-line and returns its exit status. You can define your own functions in awk. The syntax for this is: function name(parameter-list) { statements } name is the name of the function, parameter-list is a comma-separated list of variable names, which, within the function refer to the arguments with which the function was called, and statements are action statements that make up the body of the function. Function definitions can appear anywhere a pattern-action statement can appear. Recursion is permitted within user-defined functions; that is, a function may call itself directly or indirectly. Variables passed to functions (as arguments) are copied and a copy of the variable is manipulated by the function; that is, these variables are passed by value. The exception to this in awk is arrays, which are passed by reference, that is, the actual array elements are manipulated by the function, so array elements can be permanently altered, created, or deleted within a function. Missing function arguments are set to null; extra arguments are ignored. To define a return value for your function, you must include a statement return expression where expression is the value you want your function to return. expres- sion here is optional; if you leave it out, control will be returned to the caller of the function, but the return value will be undefined. The return statement itself is optional as well. The formal parameters of a function (the argument list) are local to that function, but any other variables are global. You can use the argument list as a way of creating variables local only to the function; like other variables in awk these will be automatically initialized with null values. Number or string? In awk, variables come into being when they are used; there is no declaration of a variable, and, therefore, you do not declare the type of a variable as a string or a number. Instead, awk assumes the type of a variable from its context. In an assignment statement, such as v=e, the type of v becomes the type of e. When the context is ambiguous, awk determines the types when the program runs. In comparisons, if both operands are numeric, they are compared as num- bers; otherwise, they are compared as strings. (A string is greater than another string if it comes later in the sort sequence, and less than another string if it comes earlier in the sort sequence.) All field variables are of type string; in addition, each field can be considered to have a numeric value (that is, the numeric value of a string). The numeric value of a string is the value of the longest pre- fix of a string that looks numeric. For example, if a field contains the string ``123abc'', the numeric value of this would be 123. The value of a variable in awk is initially 0 or the string "". You can force a variable of one type to become another type; this is known as type coercion. To force a number to a string: number "" (Concatenate the null string to number.) To force a string to a number: string + 0 For more information about variable types, see the chapter on awk in the User's Guide. Limits The following limits exist in this implementation of awk: 100 fields 3000 characters per input record 3000 characters per output record 3000 characters per field 3000 characters per printf string 400 characters per literal string or regular expression 250 characters per character class 55 open files or pipes double precision floating point Numbers are limited to what can be represented on your machine; numbers outside this range will have string values only. Examples The following examples are all individual awk programs; to try them out, you will need to put them in a file and call the file with awk -f, or enclose them in single quotes on the awk command line. Print lines longer than 72 characters: length > 72 Print only the first two fields in opposite order: { print $2, $1 } Same, with input fields separated by comma and/or blanks and tabs: BEGIN { FS = ",[ \t]* | [ \t]+" } { print $2, $1 } Add up the first column, print sum and average: { s += $1 } END {if ( NR > 0 ) print "sum is", s, " average is", s/NR } Print fields in reverse order (on separate lines): { for (i = NF; i > 0; --i) print $i } Print all lines between start/stop pairs: /start/, /stop/ Print all lines whose first field is different from previous one: $1 != prev { print; prev = $1 } Simulate echo(C): BEGIN { for (i = 1; i < ARGC; i++) printf "%s ", ARGV[i] printf "\n" exit } Simple env(C): BEGIN { for (e in ENVIRON) print e "=" ENVIRON[e] } See also ed(C), grep(C), lex(CP), printf(S) and sed(C). ``Simple programming with awk'' in the User's Guide Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger, The AWK Programming Language, Addison-Wesley, 1988. Notes Input whitespace is not preserved on output if fields are involved. func is an obsolete synonym for function. This version of awk is the so-called ``new awk'' described in The AWK Programming Language (referenced above). It is mostly compatible with an older version of awk still in common use. On some systems, the ``new awk'' is called nawk, the older one is oawk, and awk may be linked to either version. The nawk and oawk names do not exist on all systems, and even when they do exist, are not reliable. Only the name awk should be used. Known incompatibilities between this version of awk and older awks include: + The definition of ``what constitutes a number'' is slightly different. In the old awk, a string had a numeric value only if the entire string looked numeric. In the new awk, a string has a numeric value if a prefix of the string looks numeric, and the numeric value is the value of the longest such prefix. For example, the string: 123foo does not have a numeric value in the old awk (and is treated as 0), but has the value 123 in the new awk. + Assigning to a nonexistent field in the new awk changes $0 to include that field, whereas, in the old awk, $0 did not change. Thus, the program: { $2 = $1; print } produces different output if the input has only one field. + The new awk allows user-defined functions; these are not recognized in the old awk. + There are several new reserved words in the new awk which could be used as variable names in the old awk. + In addition, the parsing has changed, which may result in some ambiguous-looking expressions that were legal in the old awk failing with the new awk. For example, in regular expressions, the character class: [/] is not legal in the new awk, but was in the old. The equivalent char- acter class for the new awk is: [\/] However, this character class, when used with the old awk, is not equivalent to the original expression. Standards conformance awk is conformant with: AT&T SVID Issue 2; and X/Open Portability Guide, Issue 3, 1989.