Processing Big Datasets in Python using Unix Commands

Neetha Sherra
Analytics Vidhya
4 min readDec 20, 2019

--

As a data scientist, in the course of your analysis, you’re bound to come across big data sets that are larger than the free memory on your system. Given this scenario, loading the whole data set at once becomes an impossible task. For example, my system usually has 4GB of free memory while the size of some datasets I have worked with exceeded 4GB. For this post, as an example I decided to go with a small CSV dataset I have already used in one of my previous posts, the SF Crime dataset.

When you find yourself dealing with such a situation, two broad options can help you out.

First, you can read the first few rows of your file or one half of your file, in this case using the read_csv function in pandas, and get a feel for the structure of the dataset before working on it. The main disadvantage of this method is that, the data you need for your analysis may not be present in the sample that you’ve just loaded. For example, since I only want rows that correspond to a particular District and Category of Crime from my dataset; loading a portion of the entire dataset isn’t going to help me get anywhere with that analysis. We can of course work around that problem, but that’s a story for another blogpost.

The second method (the one I actually want to talk about), is using Unix commands to explore the entire file as it is. If you’re working on Jupyter Notebooks as I am in this example, it allows for running Unix commands by prepending it with a ‘!’ symbol. Since a CSV file is already structured, the task of exploring it is made easier using this method.

Before I get into the commands, I just want to say that Unix offers you many different options for a particular action. For example, you can extract lines with a particular pattern using commands ‘grep’, ‘sed’ and even ‘awk’. The commands used below are the ones I picked and found suitable among many available options. Another person may feel like a different combination of commands is more suited to their purpose.

So, let’s get into some Unix commands that I used to get information on my dataset.

wc -l <filename>
  • Here wc (word count) coupled with the option ‘l’ gives you the number of lines in the file
  • This includes the header of course!
Output of ‘wc’
head -n <filename>
  • Similar to the head(n) function in pandas, this gives the first ’n’ lines of the file
  • This output will give you a fair idea of how data is structured in the file; the headers will give you column names, and the first few lines will give you the type of data contained in each column
Output of ‘head’
cut -d’,’ -f6 <filename> | sort | uniq -c
  • Similar to value_counts() function in pandas, I used this command to get unique values along with their counts, of the column I was interested in
  • This is where the pipe operator comes into picture. Piping is a form redirecting, and here it redirects the output of one command to be used as the input of the next on
  • The ‘d’ option is used to specify the delimiter in this case a ‘,’
  • The ‘f’ option is used to specify the field number (in this case a column)
  • The output of cut is then sorted and finally the unique values in column 6 are displayed (Here the numbering of columns does not start with 0)
  • The ‘c’ option gives a count of each unique value
  • You may wonder why sort is used before ‘uniq’; this is because the ‘uniq’ command removes repeated lines from the input and only detects these repeats are adjacent; and this is achieved using ‘sort’
  • I’ve used sort again, to display the counts in decreasing order; ‘n’ does a numeric sort (ascending) and ‘r’ reverses the result
Output of ‘cut, sort, uniq’
head -1 $f > output.csv
awk -F, ‘{ if(($2 == “LARCENY/THEFT”) && ($7 == “SOUTHERN”)) { print} } $f >> output.csv
  • Here, the first command prints the header line to an output file and the second command uses a conditional statement to select lines and print them to the same output file
  • ‘>’ operator creates a new file if it doesn’t exist or overwrites the existing file
  • ‘>>’ operator creates a new file if it doesn’t exist or appends to the existing file
Output of ‘awk’

--

--