
AWK                                                                    AWK


NAME

        awk - pattern scanning and processing language


SYNOPSIS

        awk [-ffile] [-Fstr] [-t] [-l] [program] [var=text] [file] ...


DESCRIPTION

Awk scans each input file for lines that match any of a set of patterns 
specified in the program.  With each pattern in the program there can be an 
associated action that will be performed when a line of a file matches the 
pattern.

The AWK program may be specified as a file with the -f option as:

        -ffilename.ext
        -f filename.ext

in which case the AWK program is read from the named file.  If the file does 
not exist then an error message will be printed. 

The AWK program may also be specified as a single argument as:

        filename.ext
        filename[.awk]

or as a valid AWK program:

        { for (i in ARGV) printf ("%d: %s\n", i, ARGV[i]) }

AWK will first try to open the first argument as a file, it it can't open a 
file, it then adds the extension ".awk" and tries again to open a file, 
finally AWK will attempt to read the argument directly as an AWK program. 

If the filename is a minus sign (-) then the AWK program is read from the 
standard input.  The program may then be terminated with either a ctrl-Z or
a period (.) on a line by itself.  The second method is useful for entering 
an AWK program followed by the data for the program.  If no program file is 
specified then the program is read from standard input.

If the -f option is selected the full path/name/extension must be specified.  
If only the filename is specified AWK will first attempt to open the named 
file, then the file with the extension ".AWK", finally AWK will attempt to 
parse the parameter as a program.  Multiple -f options may be used to get
the program source from many files.

Files are read in order, the file name '-' means standard input.  Each line 
is matched against the pattern portion of every pattern-action statement; 
the associated action is performed for each matched pattern.

If a file name has the form variable=value, program variables may be changed 
before a file is read.  The assignment takes place when the argument would be 
treated as the next file to read.   Any assignments before the first file
take place before the first BEGIN block is executed.  An assignment after 
the last file will occur before any END block unless an exit was performed.

        awk "{ print code, NR, $$0 }" code=10 file1 code=75 file2

If no files are specified the input is read from standard input.

An input line is made up of fields separated by the field separator FS.  
The fields are denoted by $1, $2 ...; $0 denotes the entire line:

        $0 = "now is the time"

        $1 = "now"      $2 = "is"
        $3 = "the"      $4 = "time"

with the default FS (white space).  If the field separator is set to comma (,)
with "-F," on the command line then the fields might be:

        $0 = "a, b,c,, ,"

        $1 = "a"        $2 = " b"       $3 = "c"
        $4 = ""         $5 = " "        $6 = ""

A pattern-action statement has the form:

        pattern { action }

A missing { action } has the same effect as { print $0 }, a missing pattern 
always matches. 

Pattern-Actions are separated by semicolons or newlines.  A statement may be 
continued on the next line by putting a backslash (\) at the end of the line. 

        { words += NF }; END { print words }

A pattern is a test that is performed on each input line.  If the pattern 
matches the input line then the corresponding action is performed.

Patterns come in several forms:

Form            Example         Meaning

BEGIN           BEGIN {N=1}     initialize N before input is read
END             END {print N}   print N after all input is read
function        function x(y)   define a function called x
text match      /stop/          line contains the string "stop"
expression      $1 == 3         first field is the number 3
compound        /x/ && NF > 2   more that two fields and contain "x"
range           NR==10,NR==20   records ten through twenty inclusive

BEGIN and END patterns are special patterns that match before any files are 
read and after all files have been read respectivly.  There may be multiple 
occurances of these patterns and the associated actions are executed in the 
order that they occur.  

If there is only a series of BEGIN blocks in the awk program and no other 
pattern/action blocks except function declarations then no input files are 
read.  If only END blocks are defined then all the files are read and NR will 
be set to the number of records in all the files.

        BEGIN { page = 5 }

A function pattern is never matched and serves to declare a user defined 
function.  You can declare a function with more parameters than are 
passed as arguments so that the extra parameters can act as local 
variables.  

        function show(a, i) { for (i in a) print a[i] }

A regular expression by itself is matched against the input record ($0). That 
is "/abc/" is equivalent to "$0 ~ /abc/". 

Any expression will match if it evaluates to != 0 or !="".  Also any logical 
combination of expressions and regular expressions may be used as a pattern.

        FILENAME != oldname && FILENAME != "skip"

The last special pattern is two patterns separated by a comma.  This pattern 
specifies a range of records that match the pattern.  The pattern starts to 
match when the first pattern matches and stops matching when the second 
pattern matches.  If they both match on the same input record then only that 
record will match the pattern.

        /AUTHOR/,/NOTES/

An action is a sequence of statements that are performed when a pattern 
matches.  

A statement can be one of the following: 

        { STATEMENT_LIST }
        EXPRESSION
        print EXPRESSION-LIST
        printf FORMAT, EXPRESSION_LIST
        if ( EXPRESSION ) STATEMENT [ else STATEMENT ]
        for ( VARIABLE in ARRAY ) STATEMENT
        for ( EXPRESSION; EXPRESSION; EXPRESSION) STATEMENT
        while ( EXPRESSION ) STATEMENT
        do STATEMENT while ( EXPRESSION )
        break
        continue
        next
        delete ARRAY[SUBSCRIPT]
        exit [ EXPRESSION ]
        return [EXPRESSION ]

A STATEMENT_LIST is a list of statements separated by newlines or semicolons.
As with pattern-actions statements may be extended over more than one line 
with backslash (\). 
        {
            print "value:", i, \
                  "number:", j
            i = i + $3; j++
        }

Expressions take on string or numeric values depending on the operators.
There is only one string operator, concatenation, indicated by adjacent 
expressions.  The following are the operators in order of increasing 
precedence:

Operation           Operator      Example     Meaning

assignment          = *= /= %=    x += 2      two is added to x
                    += -= ^=
conditional         ?:            x?y:z       if x then y else z
logical OR          ||            x||y        if (x) 1 else if (y) 1 else 0
logical AND         &&            x&&y        if (x) if (y) 1 else 0 else 0
array membership    in            x in y      if (exists(y[x])) 1 else 0
matching            ~ !~          $1~/x/      if ($1 contains x) 1 else 0
relational          == != >       x==y        if (x equals y) 1 else 0
                    <= >= <
concatenation                     "x" "y"     a new string "xy"
add, subtract       +  -          x+y         sum of x and y
mul, div, mod       * / %         x*y         product of x and y
unary plus minus    + -           -x          negative of x
logical not         !             !x          if (x is 0 or null) 1 else 0
exponentiation      ^             x^y         x to the yth power
inc, dec            ++ --         x++         x then add 1 to x
field               $             $3          the 3rd field
grouping            ()            ($1)++      increment the 1st field

Variables may be scalars, array elements (denoted x[i]) or fields (denoted 
$expression).  Variable names begin with a letter or underscore and may 
contain any number of letters, digits, or underscores.

Variables are initialized to both zero and the null string.  Fields and the 
command line arguments will be both string and numeric if they can be 
completely represented as numbers.  The range for numbers is 1E-306..1E306.

Array subscripts may be any string.  Multi dimensional arrays are simulated in 
AWK by concatenating the individual indexes with the subscript separator 
between them.  So array[1,1] is equivalent to array[1 SUBSEP 1].   Individual
array elements may be removed with the delete statement, and the whole array
erased with an assignment to the bare variable.

        delete a[i]             # delete one element
        a = ""                  # delete all elements

Simply referencing an array element will cause it to be created and 
initialized.  To avoid creating unwanted elements use the in operator.

        if (i in a) print a[i]  # print one element (if it exists)
        for (i in a) print a[i] # print all elements (that exist)

Comparison will be numeric if both operands are numeric otherwise a string 
comparison will be made.  Operands will be coerced to strings if necessary.  
Uninitialized variables will compare as numeric if the other operand is 
numeric or uninitialized.  Eg. 2 > "10" and 2 < 10.

There are a number of built in variables they are:

Variable        Meaning                                         Default

ARGC            number of command line arguments                   -
ARGV            array of command line arguments                    -
FILENAME        name of current input file                         -
FNR             record number in current file                      -
FS              controls the input field separator                " "
NF              number of fields in current record                 -
NR              number of records read so far                      -
OFMT            output format for records                        "%.6g"
OFS             output field separator                            " "
ORS             output record separator                           "\n"
RLENGTH         length of string matched by match function         -
RS              controls input record separator                   "\n"
RSTART          start of string match by match function            -
SUBSEP          subscript separator                              "\034"


ARGC and ARGV are the count and values of the command line arguments. ARGV[0] 
is the full path/name of AWK.EXE, and the rest are all the command line 
arguments except the "-F", "-f" and program arguments which are used by AWK.

The field separator is a string that is interpreted as a regular expression.
A single space has a special meaning and is changed to /[ \t]+/, any leading 
spaces or tabs are removed.  A BEGIN action may be used to set the separator 
or it may be set by using the -F command line option. 

        BEGIN { FS = "," }      sets FS to a single comma
        "-F[ ]"                 sets FS to a single space

The record separator is a string that is either a newline or the null string.
If the record separator RS is set to the null string then multi line records 
may be read.  In this case the record separator is an empty line. Setting RS 
to "\n" will restore the default behavior.

There are a number of built in functions:

Function            Value returned

atan2(y, x)         arctangent of y/x           in the range -pi to pi
cos(x)              cosine of x                 x in radians
exp(x)              exponentiation of x         (e ^ x)
gsub(r, s)          number of substitutions     substitute s for all r in $0
gsub(r, s, t)       number of substitutions     substitute s for all r in t     
index(s)            position of s in $0         0 if not in $0
index(s, t)         position of t in s          0 if not in s
int(x)              integer part of x
length(s)           number of characters in s
log(x)              natural log of x
match(s, r)         position of r in s or 0     sets RSTART and RLENGTH
rand()              random number               0 <= rand < 1
sin(x)              sine of x                   x in radians
split(s, a)         number of fields            split s into a on FS 
split(s, a, fs)     number of fields            split s into a on fs
sprintf(f, e, ...)  formatted string
sqrt(x)             square root of x
sub(r, s)           number of substitutions     substitute s for one r in $0
sub(r, s, t)        number of substitutions     substitute s for one r in t
substr(s, p)        substring of s from p to end
substr(s, p, n)     substring of s from p of length n
system(s)           exit status                 execute command s

The numeric procedure srand(x) sets a new seed for the random number 
generator.  srand() sets the seed from the system time.

The regular expression arguments of sub, gsub, and match may be either regular 
expressions delimited by slashes or any expression.  The expression is coerced 
to a string and the resulting string is converted into a regular expression.  
This coersion and conversion occurs every time the procedure is called so the 
regular expression form will always be faster. 

The print and printf statements come in several forms:

Form                            Meaning

print                           print $0 on standard output
print expression, ...           prints expressions separated by OFS
print(expression, ...)  
printf format, expression, ...
printf(format, expression, ...)
print >"file"                   print $0 on file "file"
print >>"file"                  append $0 to file "file"
printf(format, ...) >"file"
printf(format, ...) >>"file"

close("file")                   close the file "file"

The print statement prints its arguments on the standard output, or the 
specified file, separated by the current output field separator, and 
terminated by the output record separator.  The printf statement formats its 
expression-list according to the format.  The file is only opened once unless 
it is closed between executions of the print statement.  A file than is open 
for output must be closed if it is to be used for input.  The "file" argument 
may any expression that evaluates to a DOS file name. 

There is one function that is used for input.  It has several forms

Form                            Meaning

getline                         read the next record into $0
getline s                       read the next record into s
getline <"file"                 read a record from file "file" into $0
getline s <"file"               read a record from file "file" into s

getline returns -1 if there is an error (such as non existent file), 0 on 
end of file and 1 otherwise.  The pipe form mentioned in the book is not 
implemented in this version.

The for ( i in a ) statement assigns to i the indexes of a for all elements 
in a.  The while (), do while (), and for (;;) statement is as in C as are 
break and continue. 

The next statements stops processing the pattern action statements and reads 
in the next record.  An exit will cause the END actions to be performed or if 
encountered in an END action will cause termination of the program.  The 
optional expression is returned as the exit status unless overridden by a 
further exit statement in an END action. 

The return statement may be used only in function declarations.  It may have 
an option value to return as the value of the function.  The value of a 
function defaults to zero/null (0/""). 

REGULAR EXPRESSIONS

A \ followed by a single character matches that character.

The ^ matches the beginning of the string.

The $ matches the end of the string.

A . matches any character.

A single character with no special meaning matches that character.

A string enclosed in brackets [] matches any single character in that string.  
Ranges of ASCII character codes may be abbreviated as 'a-z0-9'.  A left 
bracket ] may occur only as the first character of the string.  A literal - 
must be placed where it can't be mistaken as a range indicator. If the first 
character is the caret ^ then any character not in the string will match. 

A regular expression followed by * matches a sequence of 0 or more
matches of the regular expression.

A regular expression followed by + matches a sequence of 1 or more
matches of the regular expression.

A regular expression followed by ? matches a sequence of 0 or 1
matches of the regular expression.

Two adjacent (concatenated) regular expressions match a match of the first 
followed by a match of the second. 

Two regular expressions separated by | match either a match for the
first or a match for the second.

A regular expression enclosed in parentheses matches a match for the
regular expression.

The order of precedence of operators at the same parenthesis level is 
[] then *+? then concatenation then |.


PRINTF FORMAT

Any character except % and \ is printed as that character.

A \ followed by up to three octal digits is the ASCII character
represented by that number.

A \ followed by n, t, r, b, f, v, or p is newline, tab, return, backspace, 
form feed, vertical tab, or escape. 

%[-][number][.number][l][c|d|E|e|F|f|G|g|o|s|X|x|%] prints an expression:

The optional leading - means left justified in the field
The optional first number is the field width
The optional . and second number is the precision
The optional l denotes a long expression
The final character denotes the form of the expression

        c character
        d decimal
        e exponential floating point
        f fixed, or exponential floating point
        g decimal, fixed, or exponential floating point
        o octal
        s string
        x hexadecimal 

An upper case E, F, or G denotes use of upper case E in exponential format.
An upper case X denotest hexadecimal in upper case.
Two percent characters (%%) will print as one.

A format will match the regular expression:

        /[^%]*(%(%|(-?([0-9]+)?(\.[0-9]+)?l?[cdEeFfGgosXx]))[^%]*)*/

EXAMPLES

Print lines longer than 72 characters (missing action is print):

        length($0) > 72

Print first two fields in opposite order (missing pattern is always match):

        { print $2, $1 }

Add up first column, print sum and average:

                { s = s + $1 }
        END     { print "sum is", s, "average is", s/NR }

Print fields in reverse order:

        { for (i = NF; i > 0; --i ) print $i }

Print all lines between start/stop pairs:

        /start/,/stop/

Print all lines whose first field is different from previous one:

        $1 != prev { print; prev = $1 }

Convert date from MM/DD/YY to metric (YYMMDD):

        { n = split(date, a, "/"); date = a[3] a[1] a[2] }

Copy a C program and insert include files:

        $1 == "#include" && $2 ~ /^"/ {
                include = $2;
                gsub(/"/, "", include);
                while ((getline <include) > 0) print
                next
        }
        { print }

AUTHOR

        Rob Duff,  Vancouver,  B.C.,  V5N 1Y9
        BBS: (604)877-7752  Fido: 1:153/713.0

DATE

        08-Feb-90

SEE ALSO

M. E. Lesk and E. Schmidt,
        LEX - Lexical Analyser Generator

A. V Aho, B. W Kernighan, P. J. Weinberger,
        Awk - a pattern scanning and processing language

A. V Aho, B. W Kernighan, P. J. Weinberger,
        The AWK Programming Language 
        Addison-Wesley 1988   ISBN 0-201-07981-X


NOTES

There are no explicit conversions between numbers and strings.  To force an 
expression to b treated as a number add 0 to it; to force it to be a string 
concatenate "" to it.  Array indices are strings and may have the same 
numerical value but will index different values (eg "01" vs "1").

LIMITS

        stack depth is 500
        number of files is 10
        largest string is 4000
        input line size is 2000
        number of variables is 100
        function call depth is 100
        highest field number is 100

