Exploring the US University Ranks 2017 at the Command-line (Part III — cut, sort and uniq)

Image adopted and modified from the unsplash.com

Hello coders! welcome back to my lessons from the series Data Science @ Bash Shell. In this lesson, we want to calculate how many Institutes have been ranked from each of the US states in the data set. You can also get the whole lesson as a pdf file from the following link:

You should download the data from the web page below, as we have slightly simplified the data and let’s save the data as: unirank.csv

wget https://www.scientificprogramming.io/datasets/unirank.csv

Finding the number of Institutes from each state

Let’s start by extracting only the part of each line that is relevant to us. In our case, notice that we are interested in column #1 and 3 (university and state names, respectively). To extract these columns, we can make use of a command called cut as follows:

$ cat unirank.csv | cut -f1,3 -d,

Here, the command-line option -f specifies which field (column) to extract or cut out (extract) from the file and the option (d,) tells that we want delimit the cuts by comma (,). When you run that command, you should see that the output consist only of lines such as university names and states. Note that, despite its name, the cut command does not modify the original file it acts on. Now onto the last part. We would like to count how many unis came from each state. However, this is a complex procedure and there isn’t one command that can do all that; we will have to use two commands. Here we need the command uniq -c to count (hence the -c ) how many unique appearances of each state. However, uniq -c requires the input to be sorted, so the first step is to sort the list of universities and states. We can do this very easily with a command that is conveniently called sort :

$ cat unirank.csv | cut -f1,3 -d, | sort -k 2 -t”,”

The sort options: k 2 tells sort function to select the column 2 as a key and t”,” option tells that the delimiter is a comma (,).

Output: Institutes sorted by states

Notice that, as a result of our list being sorted, all the lines with same state are right next to each other. Now, as mentioned in our plan above, we’ll use uniq -c to “condense” neighboring lines that are the same and in the process, count how many of each are seen:

$ cat unirank.csv | cut -f3 -d, | sort | uniq -c
Output: Institutes in the CA (California) state.

We now have a listing of how many unis came from each state, and it’s clear that the vast majority of ranked Institutes came from the state of CA — California!

Finding a correlation between ranks and tuition fees?

We already know the ranks and the tuition fees per university (given in the data). An interesting question to investigate would be to find what’s the correlation of uni ranks with tuition and fees?

To carry out this, we first cat and cut out the Tuition and fees (column 4) and and Ranks (column 6) from the data into the new data set called udata.csv:

cat unirank.csv | cut -f4,6 -d, > udata.csv

Note that the redirection symbol (>), helped us to save the output. Now this data can simply be plotted using a scatterplot tool called scatter (install: sudo pip install bashplotlib).

cat udata.csv| tail -n +2 | scatter

Note that tail -n +2 excludes the first row i.e., column titles prior to passing the output all the way to end to scatter. However, this tool’s output doesn’t make much sense, as it doesn’t show any x,y- axes legends. Therefore, we uploaded the data (udata.csv) to an online tool called plot.ly, which produced the following beautiful scatter plot:

Output: Ranks vs. tuition: a scatterplot via plot.ly

It’s a no brainer to understand from the plot above that highly ranked universities have higher tuition fees! However, the scatterplot also depicts one university (Brigham Young University — Provo) that had a higher rank (rank=68) with an extremely low tuition fees ($5300 USD p/a). Is this an anomaly (outlier) in the data set? We leave the question for you to further investigate!

Summary

In this project we have learned to used some important bash commands like head, tail, sort, uniq, cut, etc. in the context mining a csv formatted toy dataset consisting of rankings of the US academic Institutes.

See the previous parts: Part I, Part II 🚶

[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]

Related works