Processing Big Datasets in Python using Unix Commands

Published in

Analytics Vidhya

4 min readDec 20, 2019

As a data scientist, in the course of your analysis, you’re bound to come across big data sets that are larger than the free memory on your system. Given this scenario, loading the whole data set at once becomes an impossible task. For example, my system usually has 4GB of free memory while the size of some datasets I have worked with exceeded 4GB. For this post, as an example I decided to go with a small CSV dataset I have already used in one of my previous posts, the SF Crime dataset.

When you find yourself dealing with such a situation, two broad options can help you out.

First, you can read the first few rows of your file or one half of your file, in this case using the read_csv function in pandas, and get a feel for the structure of the dataset before working on it. The main disadvantage of this method is that, the data you need for your analysis may not be present in the sample that you’ve just loaded. For example, since I only want rows that correspond to a particular District and Category of Crime from my dataset; loading a portion of the entire dataset isn’t going to help me get anywhere with that analysis. We can of course work around that problem, but that’s a story for another blogpost.

The second method (the one I actually want to talk about), is using Unix commands to explore the entire file as it is. If you’re working on Jupyter Notebooks as I am in this example, it allows for running Unix commands by prepending it with a ‘!’ symbol. Since a CSV file is already structured, the task of exploring it is made easier using this method.

Before I get into the commands, I just want to say that Unix offers you many different options for a particular action. For example, you can extract lines with a particular pattern using commands ‘grep’, ‘sed’ and even ‘awk’. The commands used below are the ones I picked and found suitable among many available options. Another person may feel like a different combination of commands is more suited to their purpose.

So, let’s get into some Unix commands that I used to get information on my dataset.

wc -l <filename>

Here wc (word count) coupled with the option ‘l’ gives you the number of lines in the file
This includes the header of course!

Output of ‘wc’

head -n <filename>

Similar to the head(n) function in pandas, this gives the first ’n’ lines of the file
This output will give you a fair idea of how data is structured in the file; the headers will give you column names, and the first few lines will give you the type of data contained in each column

cut -d’,’ -f6 <filename> | sort | uniq -c

Similar to value_counts() function in pandas, I used this command to get unique values along with their counts, of the column I was interested in
This is where the pipe operator comes into picture. Piping is a form redirecting, and here it redirects the output of one command to be used as the input of the next on
The ‘d’ option is used to specify the delimiter in this case a ‘,’
The ‘f’ option is used to specify the field number (in this case a column)
The output of cut is then sorted and finally the unique values in column 6 are displayed (Here the numbering of columns does not start with 0)
The ‘c’ option gives a count of each unique value
You may wonder why sort is used before ‘uniq’; this is because the ‘uniq’ command removes repeated lines from the input and only detects these repeats are adjacent; and this is achieved using ‘sort’
I’ve used sort again, to display the counts in decreasing order; ‘n’ does a numeric sort (ascending) and ‘r’ reverses the result

head -1 $f > output.csv
awk -F, ‘{ if(($2 == “LARCENY/THEFT”) && ($7 == “SOUTHERN”)) { print} } $f >> output.csv

Here, the first command prints the header line to an output file and the second command uses a conditional statement to select lines and print them to the same output file
‘>’ operator creates a new file if it doesn’t exist or overwrites the existing file
‘>>’ operator creates a new file if it doesn’t exist or appends to the existing file

Processing Big Datasets in Python using Unix Commands

Written by Neetha Sherra