How much does your professor make?

Eduardo Poleo
Influitive Crafters
3 min readJul 24, 2016

At Influitive a great deal of our app dumps JSON data at certain endpoints; this data is then processed and integrated with React to generate highly interactive views. In this and a subsequent post I am going to show how we can use a similar process to generate interactive graphics using rails and the D3 library.

The data that we are going to use corresponds to the Public Salary Disclosure of Universities 2014 for Ontario. In this first post we are going to gather this data by means of web-scraping and dump it as JSON in different endpoints. In the following post we'll use the D3 library to generate dynamic web plots out of this data.

Mechanize, Rails and Web-Scraping

We are going to use the Mechanize gem to gather our the data, but first we need to take a close look at the DOM structure we are about to scrape. Fortunately, the people of the Ontario government have created a nicely structured page that we can easily scrape. The snippet below shows the parts we really care about:

There are few key things we need to note in here:

  • The data is paginated which means that we will need to extract the number of pagination links and iterate over them to obtain the data from each page.
  • Each row within the <tbody> maps to an individual’s information although the structure is not 100% consistent because the university and the title are wrapped inside span tags. So we will need to account for that in our script.
  • Finally, since we want to store all the data in our database, every time we iterate over a full row we probably want to create a Staff record which is a Rails active record model.

With this in mind we can proceed and write the following web scraper.

**NOTE**: I am using the xpath notation to dig into the DOM structure. Just if you were wondering what things like "//thead/tr/td[2]/a" were.

For the purpose of this study we want to calculate the average staff earning per university. Now if we take a closer look to our data we can see that the individuals that compose our records can generally be divided into two different groups: academic (professors) and administrative (non-professors). So we want to consider at least three different types of averages:

  • overall_salaries: Professors + Administrative
  • professors_only: Professors
  • administrative_only: Administrative

Ideally, we want the data for each type of average being dumped as JSON into their own specific endpoint, so that users can easily filter the information they want to see. We could make the average calculations on the fly every time the user decides to switch endpoints. But this will either require a complex query or looping through all the ```Staff``` records every time, both of which can be cumbersome and slow.

Instead, we could do these calculations right after finishing scrapping the data and have it ready available (one simple query away) for when the user decides to request it. We can then write the following:

Finally, all we need to do is to create three different end-points on the AverageController to dump the information into. We should also serialize the data so that we have an "easy to work with" format when we deal with in the front-end.

That's it for now the full code for the project can be found here. On my next post I am going to show how to integrate D3 with rails and how to use the data we have gathered here to generate awesome web plots.

Hope you enjoyed the reading. Happy Coding!

--

--