Getting AWK

AWK is a programming language created in the seventies by Alfred Aho, Peter Weinberger and Brian Kernighan (hence its name AWK).

Although is turing complete, it was designed to be efficient at one specific task — Text Processing — . That is, some text goes in, transformations happen, and other text goes out. That’s why most AWK programs are one liners that parse the output of other UNIX commands.

On most UNIX systems, AWK is already installed, so there is no setup involved. You can start writing programs right off the bat.

What follows it’s a short tutorial on awk and its main features. It’s mostly about basic stuff but also covers some what “advanced” topics, like: user defined functions, sorting, and grouping.

The goal of this post is to provide a solid introduction to AWK, explore its main features and learn under which circumstances is a good tool for the job. Hope you enjoy it.

Text processing with AWK.

Suppose you have a file named companies.txt that contains information about companies (name and email, in this case) and you want to extract the list of emails from it.

Company    Email
Foo Inc. foo@example.com
Bar Corp. bar@example.com
Baz baz@example.com

An AWK program to do that would look something like this:

NR > 1 { print $NF }

Yes, you read it right. Just one line of code. No need to open files, read lines, close handles, or anything like that. AWK will do that for you. You just have to tell it how to process each record (a line in this case).

Just for illustration purposes, let’s take a look at how the same program would look like in ruby:

# Skip headers and prints the last field from each record.
File.readlines("companies.txt").each_with_index do |line, idx|
next if (idx == 0) # Skip headers.
puts line.split.last
end

Although ruby’s syntax is succinct and right to the point, it doesn’t even come close to what you can do with AWK.

Before going on, let’s run the program to see if it produces the expected result.

Save the code above to a file named extract_emails.awk and these commands: (Note for non UNIX users: The dollar sign you will at the beginning of commands is the shell’s prompt, you don’t have to type it.)

$ cat ./companies.txt | awk -f ./extract_emails.awk

Since AWK programs runs “over” input streams (some form of text), you must provide one. That’s what the shell is doing in the sequence above when it “pipes” the output of the cat command to awk. It basically it says: “awk, please use whatever cat returns as your input stream.”

The end result should be something like this:

foo@example.com
bar@example.com
baz@example.com

You can also run “ephemeral” AWK programs from your terminal. (Where ephemeral means: programs that are not saved to disc.). Let’s try that.

(Important notice: Enclose your program using single quotes. Otherwise is not going to work.)

$ cat ./companies.txt | awk 'NR > 1 { print $NF }'
foo@example.com
bar@example.com
baz@example.com

As you see, the result is exactly the same.

Structure of AWK programs

An AWK program is a sequence of patterns and actions that runs against an input stream. When the current record matches a pattern, the corresponding action gets executed. You can have as many patterns as you want; AWK will execute each and every matching action. (Not just the first that matches, as it happens with other tools, for instance: web routing libraries.)

In pseudo code, that will be:

sepatator = ' '
while ((record = read_record()) != EOF) {
fields = record.split(separator)
	// test rule 1
if (match(pattern1, fields)) { /* run this code */ }
	// test rule 2
if (match(pattern2, fields)) { /* run this code */ }
	// test rule 3
if (match(pattern3, fields)) { /* run this code */ }
	// rule N...
// ...
}

In AWK terms, each condition in the code above would be a pattern, and the code block that goes inside the if statement would be an action.

To get fields out of records AWK splits the content of a record using a field separator (FS). By default, that field separator would be a space, but you can change it to be almost whatever you want.

To access the fields from the current record, you have to use the “dollar index” variables that AWK defines for you. $1 points to the first field, $2 to the second, and so on… (AWK’s indices are 1 based.)

In pseudo code, that will be:

separator = ' '
record = "foo bar baz"
fields = record.split(separator)
$1 = fields[0]
$2 = fields[1]
$3 = fields[2]

For instance, if you run the following program, it will print “bar”.

echo "foo bar baz" | awk '$2 { print $2 }'

AWK has a lot of builtin variables that provides information about the input stream and the runtime environment. For instance, number of fields in the current record, field separators, record numbers, and so on…

Let’s say you have to process a file that contains column headers and you want to skip them. You can do that using the builtin variable NR (Record Number) by telling AWK to print all records except the first one.

NR > 1 { print $NF }

An interesting bit of the program that process company’s information, is the use of the $NF variable; That bultin variable points to the last field of the current record. Which in the input stream is the one that contains company’s email addresses.

But why use $NF instead of $2?

Well, that is because in this case the number of fields per record is variable. Yeah, you read that right. Although the input file may look like a couple of records pulled out from a database, the field separator that the program uses by default is a space. So when a company have a composite name, like “foo co.”, AWK will split that name as if it were two different fields, so to access the email address for that particular company you have to use $3 as opposed to the expected accessor $2.

Let’s take a look at these records:

$1  $2   $3 | $NF
Foo Inc. foo@example.com
$1  $2 | $NF
Baz baz@example.com

Since in this case you know for sure that the field that you want to print is the last one, and that field can’t contain spaces, it doesn’t matter how many fields the current record has, by using $NF you will always print the right value.

Another cool thing about awk is that pretty forgiving when it comes to undefined fields, if you ask for a field that is not there, it simply return “”. No errors, no crashing, nothing like that. (Which is nice, because you don’t have to add null checks all over the place.)

Using regex in AWK patterns

Let’s say that now you only want to print the email from companies whose names start with the letter “B”. This is a bit more complex, but since AWK patterns can be regexs, still a one liner:

/^B/ { print $NF }

Run the program above, you should get:

bar@example.com
baz@example.com

About field separators (FS)

By default AWK will use spaces as fields separators. I guess that is because spaces is what most UNIX programs use to format their output. (You can run: ps aux or ls -s to see what I mean.)

But using spaces as field separator is not always a good idea. In the case of company’s information, you will be better off using tabs or semicolons. So, let’s try that instead.

Save the next lising to a file named “companies.csv” and run the program again.

Company Name;Email
Foo Inc.;foo@example.com
Bar Corp.;bar@example.com
Baz;baz@example.com

Extract emails:

cat ./companies.csv | awk -f extract_emails.awk
# --------------^ (remember, csv.)

As you can see, the program is broken… But no worries, that is an easy fix, you just have to tell AWK to use “semis” as fields separator and your program will work again.

To do that, you will use the BEGIN section. BEGIN is special section that lets you write code that needs to be run before AWK start processing records.

The only thing that you need to do this time is to change the value of FS to “;” (The rest of the program stays the same.)

BEGIN { FS = ";" }
NR > 1 { print $NF }

Run the program again and this time you should get the right results.

Adding a bit of structure

Suppose you have a larger list of companies and want to add a bit of structure to the program’s output. Let’s say, sorting by company name.

To do that you are going to use a new file called “fake_companies.txt” that contains this listing.

Company Name;Email
Acme Corporation;acme@example.com
Globex Corporation;globex@example.com
Soylent Corp;soylent@example.com
Initech;initech@example.com
Bluth Company;bluth@example.com
Umbrella Corporation;umbrella@example.com
Hooli;hooli@exaple.com
Vehement Capital Partners;vehement@example.com
Massive Dynamic;massive@example.com
Wonka Industries;wonka@example.com
Stark Industries;stark@example.com
Gekko & Co;gekko@example.com
Wayne Enterprises;wayne@example.com
Bubba Gump;bubba@example.com
Cyberdyne Systems;cyberdyne@example.com
Genco Pura Olive Oil Company;genco@example.com
The New York Inquirer;tnyi@example.com
Duff Beer;duff@example.com
Olivia Pope & Associates;olivia@example.com
Sterling Cooper;sterling@example.com
Ollivander's Wand Shop;ollivander@example.com
Cheers;cheers@example.com
Krusty Krab;krusty@example.com
Good Burger;goodb@example.com

The program you are about to see uses the END section. END, is special section as BEGIN, but that runs when AWK is done processing records;

In the next program I also introduce “user defined functions”. That is, functions that you can create to enhance AWK built in functionality.

Create a file named “extract_and_sort.txt” and paste this code into it.

(Since there is a lot going on in this piece of code, I’ve added some comments to it, to make it easy to follow.)

BEGIN  { FS = ";" }
NR > 1 {
# This time, instead of printing the email address to
# the console, the program stores it into a hash alike
# data structure to print them latter.
# (Note that there is no need to define *emails*. AWK will do
# that for us the first time we use that variable.)

# $1 == company name.
# $2 == emails address.
emails[$1] = $2;
}
END {
# This section is responsible for sorting and printing fields.
# 1. Get unsorted company's names (a.k.a. keys)
for (name in emails)
companies[++i] = name;

# 2. Sort companies using the user defined function
# *isort*. (See at the bottom of the program.)
isort(companies, NR);

# 3. Print company's information sorted by name.
for (i = 1; i < length(companies); ++i) {
name = companies[i]
printf("%s %s\n", name, emails[name])
}
}

# (Since AWK doesn't have this function, we have to roll our own.)
# Insertion sort.
function isort(arr, n) {
for (i = 2; i < n; ++i) {
for (j = i; j > 1 && arr[j-1] > arr[j]; --j) {
# swap
tmp = arr[j-1];
arr[j-1] = arr[j];
arr[j] = tmp;
}
}
}

Now pipe fake_companies.txt to awk and use the new version of the program to get a sorted list of companies and their email addresses.

$ cat fake_companies.csv | awk -f extract_and_sort.awk

Just one more and you would be ready to go

To finish this introduction to AWK, we are going to add one more feature to the program: “grouping by letter”.

There’s a lot going on in this program, so take a close look to the code and its comments.

# Sort companies by name and print their name and email.
# The input file format is
# Company Name;Email
BEGIN { FS = ";" }
NR > 1 {
emails[$1] = $2;
}
END {
# Get unsorted company's names (a.k.a. keys)
for (name in emails)
companies[++i] = name;

# Sort keys.
isort(companies, NR);

# Print companies sorted by name.
last_seen = ""
for (i = 1; i < length(companies); ++i) {
name = companies[i]
if (begin_group()) {
print_line(80)
last_seen = name
}
printf("%s %s\n", name, emails[name])
}
print_line(80)
}

function begin_group() {
# Note that since variables are global, you don't have to pass
# *last_seen* and *name* as agruments to this funcion.
# (Unless that you need function scoped variables, you
# don't need to pass arguments.)
return (substr(last_seen, 1, 1) != substr(name, 1, 1));
}

# What's going on with the *i* paramter?
# In AWK all variables are global, and since this function
# is called from a loop that also uses the *i* variable, you
# need to add *i* to the parameter list to create a local
#(function scoped) version of it.
function print_line(width, i) {
for (i = 1; i <= width; ++i) {
printf("%s", "-");
}
print ""
}

# Insertion sort.
function isort(arr, n) {
for (i = 2; i < n; ++i) {
for (j = i; j > 1 && arr[j-1] > arr[j]; --j) {
# swap
tmp = arr[j-1];
arr[j-1] = arr[j];
arr[j] = tmp;
}
}
}

Run the program again and you should get a sorted list of companies and their email addresses.

$ cat fake_companies.csv | awk -f extract_and_sort.awk

And that’s it for this introduction to AWK. Down below you will find a list of commonly used builtin variables, a couple of useful one liners and additional resources to learn more about AWK.

Built-in variables

Here is a list of commonly used built-in variables.

ARGC      Number of arguments.
ARGV Array of arguments.
FILENAME Current file name.
$0 Current input record.
FS Input field separator (default ' ').
RS Input record separator (default '\n').
NR Current input records count since beginning.
NF Number of fields in current record.
OFS Output field separator (default ' ').
ORS Output record separator (default '\n').
OFMT Output format for numbers.

To learn more about variables, command line arguments, functions, and more, you can check the man pages by running man awk.

To learn more about variables, command line arguments and functions, you can check the man pages by running man awk.

A couple of one-liners

Although you could write full blown programs in AWK, most of the time you are going to work with one-liners. Here are are some interesting ones that you can use if you are going to work on source code files:

Print the total number of lines that reference “foo”:

/foo/ { count = count + 1 }
END { print count }

Print lines whose length is greater than 80 characters.

length($0) > 80

Print the total number of lines:

END { print NR }

Summary/Recap

  • AWK rocks at text processing.
  • Split streams into records by using new lines (but you can change that).
  • Split records into fields by using spaces (but you can change that, too).
  • When a record matches a pattern, AWK triggers the action associated with that pattern.
  • All variables are global.
  • No need to define variables.
  • Variables can be locally scope to a function by adding them to the function’s parameter list.
  • No null checks
  • One-liners most of the time
  • Programs can be loaded from disk or typed at the terminal.

Whant to know more about awk?

Who’d be better to write a book about a programming language than its creators themselves! The The AWK Programming Language is the go to book for serious AWK developers.

And of course, as with most UNIX tools, the man pages are also a great source of knowledge.

Thanks for reading! And see you next time on getting programming tools.