A friendly introduction to Awk Command
Awk is a powerful command-line tool for manipulating text files in Unix-like operating systems. It allows you to search for patterns in a file and perform actions on the lines that match those patterns. Awk is a versatile tool that is widely used for a variety of tasks, such as data extraction, data manipulation, and report generation.
In this article, I will provide a friendly introduction to the Awk command and explore its features through examples.
Basic Syntax
Awk has a simple syntax. The basic structure of an Awk command is as follows:
awk 'pattern { action }' file
Here, the pattern specifies which lines to match in the file, and the action is the command to perform on the matched lines. The file is the input file to be processed.
For example, let’s consider the following text file (example.txt):
apple 5
banana 10
cherry 15
orange 20
Suppose we want to print the first column of this file. We can use the following Awk command:
$ awk '{ print $1 }' example.txt
apple
banana
cherry
orange
Here, the pattern is empty, meaning the action will be applied to all lines in the file. The action is to print the first column (separated by whitespace) of each line.
Pattern in depth
As I said patterns in awk
, specifies which lines to match. It can be one of the following things:
Regular expression: In this case, awk
will continue its work with any line that matches the regular expression. For example
$ awk '/1/ {print $0}' example.txt
banana 10
cherry 15
Here awk
match any line that contains “1” and the action prints the whole line.
Logical expression: Patterns can also be specified using logical expressions, which can include comparisons, Boolean operators, and regular expressions. There are some examples
$ awk '$2 > 10 { print $1 }' example.txt
cherry
orange
We select any line that has in the second column value greater than 10 and print its first column. You can use any of the standard operators ==
, !=
, >
, >=
, <
, <=
you can also test if the variable matches the regular expression ~
or not !~
.
$ awk '$1 ~ /e$/ { print $1 }' example.txt
apple
orange
Here we select rows where the first column match regex /e$/
alias the first column ends with the letter ‘e’.
The logical expression can be also connected with logical operators:
&&
- logical AND operator||
- logical OR operator!
- logical NOT operator
$ awk '$2 >= 10 && $2 <= 20 { print $0 }' example.txt
banana 10
cherry 15
orange 20
Instead of &&
you can sometimes see just ,
it’s the abbreviation and $2 >= 10, $2 <= 20
would produce the same output.
Finally, you can of course use parentheses to write more complex expressions or increase readability.
$ awk '(!/r/) || (/^o/) { print }' example.txt
apple 5
banana 10
orange 20
Select lines that don’t contain the letter ‘r’ or start with the letter ‘o’.
Variables in Awk
Awk provides several built-in variables that can be used to manipulate the data. Some of the commonly used variables are:
- $0: the entire line
- $1, $2, $3, etc.: the first, second, third, etc. fields (columns) of the line
- NR: the number of the current record (i.e., the line number)
- NF: the number of fields (columns) in the current record
- FS: the field separator (default is whitespace)
We can also define our variable with the =
operator.
Let’s consider an example where we want to calculate the total of the second column in the example.txt file. We can use the following Awk command:
awk '{ sum += $2 } END { print sum }' example.txt
Here, we are using a variable called sum to accumulate the values of the second column. The END
keyword specifies that the action should be performed after all the lines have been processed.
You can see that you don’t have to declare the variable before you use it for the first time. Another thing you should pay attention to is the +=
operator. It is the shorthand for sum = sum + $2
.
Awk contains two data types strings and numbers. However, as you can see in the example above whenever you try to use a string in numerical operation awk tries to convert it to a number.
Similarly to the END
section, you can declare the BEGIN
section. It runs before the first line is processed and it is useful for printing custom headers or declaring variables. In the next example, we will try to print the name and professions of people older than 30 years from this CSV file (people.csv).
Name,Age,City,Occupation
John,30,New York,Software Engineer
Alice,25,San Francisco,Data Analyst
Michael,42,Chicago,Marketing Manager
Julia,36,Los Angeles,Product Manager
To do that we have to change the field separator from default whitespace to comma.
$ awk 'BEGIN {FS=","}; NR > 1 && $2 >= 30 {print "Name: "$1", Proffesion: "$4}' people.csv
Name: John, Proffesion: Software Engineer
Name: Michael, Proffesion: Marketing Manager
Name: Julia, Proffesion: Product Manager
BEGIN {FS=”,”};
change the field separator to a comma. NR > 1 && $2 >= 30
skips the first line (the header of the CSV) and checks if the person is old enough.{print “Name: “$1”, Proffesion: “$4}
prints line in the required format.
You can see that in the ${number}
variable are now correct values from the CSV file. This is thanks to theFS
variable. If we want to change that variable we can also use awk -F ',' '...'
instead of the BEGIN
section.
Action in depth
The action section of the awk
command is the code that gets executed on each line of input that matches the pattern. It is enclosed in curly braces {}
and can contain one or more commands separated by semicolons ;
.
The action section can perform a wide range of operations on the input data, including printing, conditional statements, arithmetic and string manipulation, and more.
As we saw one of the most commonly used commands in the action section is print
, which outputs the selected fields or variables to the console.
Another useful command is theif else
statement, which allows for the conditional execution of code based on a specified condition. For example, the following command prints info about the age of the person.
awk -F ',' 'NR > 1 { \
if ($2 < 30) \
{ print $1 " is young" } else \
{ print $1 " is still young, but a bit older"} }' people.csv
John is still young, but a bit older
Alice is young
Michael is still young, but a bit older
Julia is still young, but a bit older
The action section can also perform arithmetic operations on fields or variables, using operators such as +
, -
, *
, and /
. We saw that in the example with sum.
Another useful command is substr
, which extracts a portion of a string based on specified start and end positions. For example, the following command extracts the first three characters of the first field of each line in a CSV file:
awk -F ',' '{ print substr($1, 1, 3) }' people.csv
Nam
Joh
Ali
Mic
Jul
A bit artificial example, but you got the idea.
In addition to these commands, the action section of the awk
command supports a wide range of other functions and operators for manipulating data. By combining pattern matching and action commands, you can perform powerful data processing and analysis tasks with ease.
Conclusion
In this article, I provided a friendly introduction to the Awk command and explored its features through examples. Awk is a powerful tool for manipulating text files and provides a rich set of features that can be used to perform complex data processing tasks. With some practice, you can become proficient in using Awk to perform various text-processing tasks efficiently.