Exploring the US University Ranks 2017 at the Command-line (Part III — cut, sort and uniq)
Hello coders! welcome back to my lessons from the series Data Science @ Bash Shell. In this lesson, we want to calculate how many Institutes have been ranked from each of the US states in the data set. You can also get the whole lesson as a pdf file from the following link:
We will use a simple (publicly available) data set obtained from the data.world called: US News Universities Rankings…school.scientificprogramming.io
You should download the data from the web page below, as we have slightly simplified the data and let’s save the data as:
Finding the number of Institutes from each state
Let’s start by extracting only the part of each line that is relevant to us. In our case, notice that we are interested in column
state names, respectively). To extract these columns, we can make use of a command called
cut as follows:
$ cat unirank.csv | cut -f1,3 -d,
Here, the command-line option
-f specifies which field (column) to extract or
cut out (extract) from the file and the option (
d,) tells that we want delimit the cuts by comma (
,). When you run that command, you should see that the output consist only of lines such as university
states. Note that, despite its name, the
cut command does not modify the original file it acts on. Now onto the last part. We would like to count how many unis came from each state. However, this is a complex procedure and there isn’t one command that can do all that; we will have to use two commands. Here we need the command
uniq -c to count (hence the
-c ) how many unique appearances of each state. However,
uniq -c requires the input to be sorted, so the first step is to
sort the list of universities and states. We can do this very easily with a command that is conveniently called
$ cat unirank.csv | cut -f1,3 -d, | sort -k 2 -t”,”
k 2 tells sort function to select the column
2 as a key and
t”,” option tells that the delimiter is a comma (
Notice that, as a result of our list being sorted, all the lines with same state are right next to each other. Now, as mentioned in our plan above, we’ll use
uniq -c to “condense” neighboring lines that are the same and in the process, count how many of each are seen:
$ cat unirank.csv | cut -f3 -d, | sort | uniq -c
We now have a listing of how many unis came from each state, and it’s clear that the vast majority of ranked Institutes came from the state of CA — California!
Finding a correlation between ranks and tuition fees?
We already know the ranks and the tuition fees per university (given in the data). An interesting question to investigate would be to find what’s the correlation of uni ranks with tuition and fees?
To carry out this, we first
cut out the Tuition and fees (column
4) and and Ranks (column
6) from the data into the new data set called
cat unirank.csv | cut -f4,6 -d, > udata.csv
Note that the redirection symbol (
>), helped us to save the output. Now this data can simply be plotted using a scatterplot tool called
sudo pip install bashplotlib).
cat udata.csv| tail -n +2 | scatter
tail -n +2 excludes the first row i.e., column titles prior to passing the output all the way to end to
scatter. However, this tool’s output doesn’t make much sense, as it doesn’t show any
x,y- axes legends. Therefore, we uploaded the data (
udata.csv) to an online tool called plot.ly, which produced the following beautiful scatter plot:
It’s a no brainer to understand from the plot above that highly ranked universities have higher tuition fees! However, the scatterplot also depicts one university (Brigham Young University — Provo) that had a higher rank (rank=68) with an extremely low tuition fees ($5300 USD p/a). Is this an anomaly (outlier) in the data set? We leave the question for you to further investigate!
In this project we have learned to used some important bash commands like
cut, etc. in the context mining a csv formatted toy dataset consisting of rankings of the US academic Institutes.
[This project is a part of the ‘Learn of Analyze Data in Bash Shell and Linux’ course.]