Streamlining Data Manipulation with the power of Command Line tools

Published in

SFU Professional Computer Science

10 min readFeb 10, 2023

Authors: Linsong, Shlok Nangia, Jialiang Guo (3_datamen)

This blog is written and maintained by students in the Master of Science in Professional Computer Science Program at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit sfu.ca/computing/mpcs.

Do you know you can “clean and manipulate data” from a document without even opening the documents. Yes, you read that right! No need for any text editor.

Not only that, most of the data cleaning and manipulations tasks can be performed by command line tools by writing just 1 line. Just think of the time you can save. No need to open the document in some text editor or excel and rely on the limited functionalities they give you.

So, if you are also a geek like us who loves to work on terminal or would like to learn how to do all data cleaning, extraction, and aggregation operations in terminal(so that you can look like a hacker 😉), this article is for you.

Here we are going to look at some powerful command line tools such as ‘sed’ and ‘awk’ which can simplify and speed up your data cleaning process. After reading this article you will be able to harness the power of these versatile tools to do efficient and effective data processing.

Introduction

sed (Stream Editor): A streamline editor which performs operations on a stream of text without opening it in any text editor.
awk(created by Aho, Weinberger & Kernighan): A data-driven programming language to process and analyze text.
grep(Global Regular Expression Print): A command line tool to quickly search(patterns and regular expressions) and filter text data from one or multiple files.
Combination of above tools :
Although each of the above tools are powerful enough but a combination of these can make many complex operations easy.
For example:- When using extremely large datasets, firstly, sed can be used to search and replace some text and then, awk to extract some meaningful data and perform complex calculations, which will automate the time consuming complex tasks and make you more efficient.

Installation

For Mac and Linux Users: If you are using “Mac OS” or a “Linux” distribution, you already have these amazing tools installed in your system. Open up the ‘terminal’ and try the following commands to invoke these tools.

>$ sed --help
>$ grep --help
>$ awk --help

For windows Users: You can install these tools from given links, but we recommend to use ‘Git Bash’ or ‘WSL(Windows Subsystem for Linux)’ , where you can find these tools preinstalled.
- sed for Windows
- grep for Windows
- Gawk for Windows

Motivation

Following are few more use cases where the combination of above tools can be used.

Data cleaning: Removing unwanted characters, formatting data, and fixing errors.
Data extraction: Selecting specific data based on certain conditions.
Data transformation: Converting data from one format to another.
Data aggregation: Calculating statistics and aggregating data.

Let’s Dig Deeper

1. grep : global regular expression print

This tool is the goto tool to perform quick search operations in the files while working on terminal or command line. It can also be used to search and filter data in multiple files with the use of regular expressions. Let’s look at some examples.

Examples:

Find all occurrences of the pattern ‘patricia’ in a file:

$ grep 'patricia' myfile

Same as above but looking only for complete words:

$ grep -w 'patricia' myfile

Find all occurrences of the pattern ‘.Pp’ at the beginning of a line:

$ grep '^\.Pp' myfile

Find all lines in a file which do not contain the words ‘foo’ or ‘bar’:

$ grep -v -e 'foo' -e 'bar' myfile

2. AWK

As mentioned before, ‘awk’ is a data driven programming language which is mainly used for text processing, data extraction and reporting tool. It was created by Aho, Weinberger & Kernighan(hence the name AWK). This powerful tool allows us to parse the data from the files and perform various operations on them. It also allows to search for patterns and perform minor calculations on the data.

Examples:

Print lines longer than 72 characters.

$ length($0) > 72

Print first two fields in opposite order.

{ print $2, $1 }

Same, with input fields separated by comma and/or spaces and tabs.

BEGIN { FS = ",[ \t]*|[ \t]+" }
         { print $2, $1 }
         { s += $1 }
   END  { print "sum is", s, " average is", s/NR }

Add up first column, print sum and average. /start/, /stop/ Print all lines between start/stop pairs.

BEGIN     {    # Simulate echo(1)
        for (i = 1; i < ARGC; i++) printf "%s ", ARGV[i]
        printf "\n"
        exit }

3. sed : stream editor

As we mentioned before ‘sed’ is a stream editor i.e. a command line tool which is used for editing file. It can be used for multiple operations such as insert or delete lines , search or replace for specific words or text(regex) patterns. The power of this tool allows us to even edit multiple text files at the same time. Let’s see some examples to truly understand it.

Examples:

Replace ‘bar’ with ‘baz’ when piped from another command:

$ echo "An alternate word, like bar, is sometimes used in examples." | sed 's/bar/baz/'

Using backlashes can sometimes be hard to read and follow:

$ echo "/home/example" | sed  's/\/home\/example/\/usr\/local\/example/'

Using a different separator can be handy when working with paths:

$ echo "/home/example" | sed 's#/home/example#/usr/local/example#'

Replace all occurances of ‘foo’ with ‘bar’ in the file test.txt, without creating a backup of the file:

$ sed -i '' -e 's/foo/bar/g' test.txt

4. The “|” (pipe) symbol

The last tool which we will be covering today is “|” i.e. pipe symbol. A lot of people don’t know any use of this tool, but in linux, this symbol is used to link 2 commands or more specifically to direct the output of one command and serve it as a input to another command.

For example: using ls | grepgets the list of contents in the directory(i.e. output of ‘ls’) and sends it as a input to ‘grep’ command.

Now let us get our hands dirty

Imagine you are a data scientist working at a real-estate company. You download a property_tax_report from this webpage. The dataset contains information on properties from BC Assessment (BCA) and City sources in 2021.

You may think that for a newly built house, it tends to have a higher price than the ones built decades ago. So let us clean the data for this idea.

1. First glance of the data

$ head -2 property-tax-report-2.csv

Here head -2 means to print first 2 lines from the data file. With this we could see how the header and the data looks like.

Since the housing price varies a lot by locations, we will only consider the houses whose postcode starts with ‘V6A’. Furthermore, we remove the houses that were built before 1900.

2. Locate the columns you what to use the filter

$ head -2 property-tax-report-2.csv|awk -F';' '{print$(NF-4),$14}'

Here the print(NF−4),14 means we only print the YEAR_BUILT and the PROPERTY_POSTAL_CODE columns.

3. Check the top 10 of the buildings number by year

$ cat property-tax-report-2.csv|awk -F';' '{print$(NF-4)}'|sort|uniq -c|sort -nr|head -10

Here sort|uniq -c|sort -nr part means we sort the build year and then uniq count the year, so we could have the number of the buildings that build from each year. Then we sort the numbers again to show the top 10.

4. Make the filter

$ head -10000 property-tax-report-2.csv|awk -F';' '{if (($14 ~ /^V6A/)&&($(NF-4)>1900))print$14,$(NF-4)}'|head -5

Here $14 ~ /^V6A/ means column 14 start with V6A. You could also replace above with this:

$ head -10000 property-tax-report-2.csv|grep 'V6A'|awk -F';' '{print$14,$(NF-4)}'|head -5

seems it works!

5. Output the result to a file

$ cat property-tax-report-2.csv|awk -F';' '{if (($14 ~ /^V6A/)&&($(NF-4)>1900))print$0}'>property-tax-report-filter.csv

Just remember the header will be removed after this. To make it become a “real” comma-separated values file just add: sed “s/;/,/g”

$ cat property-tax-report-2.csv|awk -F';' '{if (($14 ~ /^V6A/)&&($(NF-4)>1900))print$0}'|sed "s/;/,/g">property-tax-report-filter.csv

6. Last move

We create a new column and value it as (CURRENT_LAND_VALUE+ CURRENT_IMPROVEMENT_VALUE)/1000000 then we could study the whether YEAR_BUILT and HOUSE_PRICE are correlated.

$ head -2 property-tax-report-filter.csv|awk -F';' '{print$(NF-4),($20+$21)/1000000}'

Advantages

Nowadays, there are so many tools that can help us with data preparation, such as Python, Perl, R, and some even with Graphic User Interface like Microsoft Excel, Jupyter Notebook, Tableau Prep, etc. You may have already mastered some of those modern technologies, so that you might ask:

“Why should I learn some old-fashioned command line languages, given that learning something new can be such a painful process?”

Well, it’s true that sed and awk are old — they’re already in their 40s — but that doesn’t mean they’re outdated. In some ways they still perform better than other tools:

1. They are preinstalled in Unix and Unix-Like Systems

Have you ever been in a situation where your lab has just bought a new workstation, or your company’s IT support team has just updated the workstation and reinstalled system, or you are simply using someone else’s device, and you need to process a batch of data before the environment and software being configured? It can be a dilemma when your task is urgent or you don’t want to waste time configuring or waiting for the IT Technician to configure the environment. But don’t worry, all the command line tools mentioned above are typically preinstalled on Unix-based systems, and you don’t need to install any extra packages like what you do for Python.

2. They Are One-Liner-Friendly

The syntax of sed and awk is designed to be concise, allowing people to accomplish a lot of actions with relatively few lines of code, or even possibly only one line of code if the programmer is familiar with their features. For the same task, sed/awk code is usually shorter than high-level programming language code. Here is an example of using awk to extract the first column of the input csv file and write it into the output file:

$ awk -F, '{print $1}' input > output

And its equivalent Python implementation:

with open("input_path", "r") as f_input, open("output_path", "w") as f_output:
    for line in f_input:
        f_output.write(line.split(",")[0] + "\n")

The awk code here costs less effort to write and is more human-readible than the Python implementation. This is a simplest case and for more complex tasks the difference can be even larger.

3. They Run Fast

Sed and awk are not as extensible as modern languages like Python and Perl, and thus can be used for a narrower range of purposes than the latter; however, it is because of such a concentration that they can avoid any features irrelevant to text processing, keep lightweight, and get an well-optimized interpreter, so that for tasks within their capability they are normally faster than high-level programming languages, not to mention other GUI software.

Another reason why they’re fast is that sed uses a simple model of regular expression and awk compiles regular expression into state machine. The former makes it extremely fast for sed to process simple text processing tasks, while the latter helps awk with quickly determining if a line of text matches a complex pattern even for large amounts of text data.

4. They Are Mouse-Free

This one depends on personal habits. Some programmers like GUI and mouses because they make our life easier, while others prefer using keyboard only because mouse sometimes interrupt their flow of thoughts and short cuts is usually faster than mouse clicking.

If you are a keyboard lover, then sed and awk is perfect for you: it’s purely text- and keyboard-based with no need to move or click a mouse.

Final Note

The real power of command line tools is still unexplored, we have just tried to scratch the surface. There are so many more tools out there which make our day to day work easy such as cut, sort, uniq, wc, tr, etc. But that might be the topic of our future posts.

Till then , Keep Learning 😊