Data Scientist’s CLI toolbox

Alexander Vingardt
Analytics Vidhya
Published in
6 min readJan 11, 2020
Photo by Barn Images on Unsplash

Command line is an essential part of any data person’s diet.

It has been the unsung hero behind most data products I’ve built as well as countless fancy charts and excel files that businesses so rely on.

This guide is meant as an introduction into the building blocks that data pipelines are built on top of. I’m keeping everything intentionally simple so we develop the intuition for which tool does what best. In later articles we will use this knowledge to create pipelines of increasing complexity.

Let’s dive right in. Here’s what we have in store for you —

  • dealing with data sources
  • processing languages awk, sed and jq
  • bash loops
  • utilities such wc, grep, sort, uniq and others

Data source patterns —

First things first. We need to get our data from somewhere. In my experience it’s usually been a plain text or a json file. We can also generate data on the fly or have something hardcoded for testing. Even if my source is a database, granted that the dataset is not particularly intimidating in size, I would dump it into a csv to play with. If it is, I would use a sample to fine tune the pipeline.

  1. Plain text. Whitespace, comma or tab delimited.
cat data.txt

2. JSON file. To feed this data forward we need to flatten it, so jq is perfect for the task. I will cover jq in greater detail later in the post. For the sake of this example, suppose we have a list of { username, birth_date } and we intent to turn it into a tab delimited text format.

cat data.json | jq -r '.[] | [.username, birth_date] | @tsv'

3. Generated input. Usually a list of numbers. To generate 1 through 10 inclusively —

seq 10

4. Harcoded string. Very helpful for testing pipelines, particularly the ones that involve regular expressions.

echo "Lorem ipsum dolor sit amet"

Awk patterns

  1. Getting a column of data.

Suppose you have a tab delimited file with 3 columns of data and you need the second column.

cat data.txt | awk -F'\t' '{print $2}'

$0 refers to the whole line whereas $N refer to the Nth field. 1-based.

2. Making sure your file has the same number of columns on each line.

cat data.txt | awk -F'\t' '{print NF}' | sort -u

NF stands for ‘number of fields’.

If this returns a single number and it’s the number you expect, all is well. Sometimes it’s helpful to do this check if you don’t trust the source or your file was stitched from multiple sources. Better be safe.

3. You need to get the last column. Yes, sometimes that comes in handy. Here’s a twist on NF — it can be referenced just as $1, $2 and so on.

cat data.txt | awk -F'\t' '{print $NF}'

4. Math expression as a filter. Suppose we need all numbers divisible by 3.

cat integers.txt | awk '$0 % 3 == 0'

awk prints output by default so if all you have is a print statement in your main block, it can be omitted.

Sed patterns

  1. Plucking a single line from a file. Supposed we need line number 10 (1-based count).
cat data.txt | sed -n '10p'

sed just like awk prints out input lines by default. Here we pass -n to suppress this behavior and use an explicit directive ‘p’ to print when needed.

A good practice is to exit after sed has satisfied the condition and is not expecting to producing any more output. With small files it’s not an issue, but if we are dealing with logs or files of that magnitude, we can do the following.

cat data.txt | sed -n '10{p;q;}'

This quits right after it prints the 10th line.

2. Getting a range of lines. For instance, lines 10 through 20 inclusively.

cat data.txt | sed -n '10,20p'

3. To print from a particular line until the end of file. Last line number is represented by a special character $.

cat data.txt | sed -n '10,$p'

4. Perform a string substitution. Suppose we want to replace all occurrences of ‘aaa’ with ‘bbb’.

cat data.txt | sed 's/aaa/bbb/g'

5. Print lines between two regular expression matches. Here we are printing everything between a line that matches ‘aaa’ and a line that matches ‘bbb’. Start and end lines are included.

cat data.txt | sed -n '/aaa/,/bbb/p'

6. We can mix and match line matching conditions. For instance, we can print everything from a line that matches ‘aaa’ until the end of file.

cat data.txt | sed -n '/aaa/,$p'

Jq patterns

  1. Pretty print a JSON file.
cat data.json | jq

2. Extract a field from a list of objects. Suppose you have a list of people with their first name, last and age. We want age.

cat data.json | jq -r '.[] | .age'

.[] unwraps an array turning it into a stream of records and .age accesses that particular field for each record. As you have noticed we have pipes here as well.

3. Flatten data into a tsv (same as a csv but tab delimited). I prefer using tabs since it’s less likely I would run into parsing issues later down the (pipe)line.

cat data.json | jq -r '.[] | [.first, .last, .age] | @tsv'

Here we are creating a new array out of each record with the fields we want and then pipe it into a special processor ‘tsv’. Note that we are passing -r to jq. It makes sure that all literals get printed ‘raw’ meaning without any quotes around them. I don’t recall the last time I used jq without it.

4. Filter. Let’s say we want everyone over the age of 40.

cat data.json | jq -r '.[] | select(.age > 40)'

We can also have multiple conditions.

cat data.json | jq -r '.[] | select(.age > 40 and .first == "Bob")'

5. Array length.

cat data.json | jq 'length'

Make sure NOT to unwrap an incoming array before feeding it into ‘length’.

I highly recommend going over jq’s manual and cookbook. They have a ton of goodies.

6. Dump all possible paths.

cat data.json | jq -r 'paths(scalars) | join(".")'

Bash

Ahhh. Good old bash.

  1. while loops. I use them quite often if I need to do more than one thing in a single step. In this example we grab some data with curl. The remote service we are hitting returns json but all in a single line. I want it pretty printed.
cat -n long_lat.txt | while read idx long lat; do curl -X GET "https://api.fancyservice.com/search?latitude=$lat&longitude=$long" | jq > $idx.json; sleep 2; done

A few things are going on here.

  • I call ‘cat’ with -n. This numbers all lines starting at 1. Now I can loop through data with an index.
  • I read 3 columns of data — index, longitude and latitude — into their respective variables.
  • I pretty print curl’s output and dump it into a file.
  • Sleep for 2 seconds so I don’t raise any red flags (unless that api was specifically designed to get slammed)

Nifty utilities —

  1. Line count
cat data.txt | wc -l

2. String filter. Whereas awk is perfect for filters that deal with numbers, grep or sed have got lots of love for strings.

Give me all lines that start with ‘mary’ ignoring case.

cat data.txt | grep -i '^mary'

We can flip it and ask for all lines that start with something else.

cat data.txt | grep -vi '^mary'

Or match ‘mary’ exactly.

cat data.txt | grep -wi 'mary'

3. Sorting. Suppose I want to sort a file with people’s names .

cat names.txt | sort

If I want it in reverse order —

cat names.txt | sort -r

If we are dealing with numbers, then use -n

cat integers.txt | sort -nORcat integers.txt | sort -nr

4. Uniqueness. If there’s a sort somewhere, uniq is lurking right around the corner. Let’s say we want a distinct list of people’s names.

cat names.txt | sort -u

If, however, we want more than that and want to count the number of occurrences of each name —

cat names.txt | sort | uniq -c

The counts returned from uniq are not ordered. We can easily remedy this.

cat names.txt | sort | uniq -c | sort -nr

This would give us counts and their respective words in descending order. To limit it to a top 5, we can do —

cat names.txt | sort | uniq -c | sort -nr | head -5

Conclusion

Hope you enjoyed this primer.

The tools presented here have a lot more to offer than covered in this article. Hope you catch the command line bug and continue exploring on your own.

Please keep in mind that sometimes the magnificent power that a particular command has to offer might come at the expense of readability.

See you next time!

--

--