10 steps on how I sourced, edited, cleaned and analysed the data for the project…


Due to the fact that the data was on multiple pages, scraping was not an option. Whilst this would have been a much better option due to time and the fact it would have been a useful skill to use, I had to manually input each entry into the data sheet. This starts to bring up questions about the accessibility of public data, but that’s a question for another blog post.

Today, I wanted to explore how I sourced the data. It may seem a non-point, but the development should have been easy via scraping but due to issues I had to adapt and respond to the issues. This article is also useful to explore how not all data is open, a key point that we will discuss later.


Step 1: What is already out there?

The first step was to see what data, if any, is already out there. This was done with the intention to see if I needed to even source the data or if instead, I could simply download a pre-filled out Excel sheet in order to save time.

To achieve this I did 4 types of searches. The first one was a general google search asking for a list of academies in Birmingham. This turned up results like you see below, whilst they gave me lists they neither stated if a school was an academy nor could I download an Excel sheet. The second search was a Google Advance search, this focused on asking (Keywords) for downloadable excel sheets, academies, Birmingham, schools in Birmingham. This again showcased no results. The third method was just as unsuccessful, it was focused on searches on WhatDoTheyKnow and Birmingham City Council website to see if someone had requested and received a list of academies in Birmingham.

In the end, I went to the government website to find the information directly.

Step 2: Open up Microsoft Excel

The next step was to open Excel. This may sound like an obvious step, but due to the manual nature of the work, I would now have to insert the data manually. Excel was chosen because it’s simple and effective at the job.

Step 3: Find the data online

The next step was to the right data, after all, I don’t want every school in England, just Birmingham. To do this, I first searched for Birmingham under Local Authority, though in hindsight I should have done it by constituency as that would have been an interesting data visualisation.

I then selected ‘Academy’ under the school type and then selected all three stages when it comes to education phase, thus including every school on the list. I then got the list of 17o schools.

Note: You may see the comparison list, my plan was to add them all to that list then scrape that one sheet. However, when I tried to do that, I noticed that it missed out several key pieces of information including Attainment 8 Score, % of English speaking students and several other important elements.

Step 5: Manually transfer the data to Excel

The next step was long and tedious and took over 1 week to complete, due to work commitments. This was also a revealing part of the project as it showcased both the importance of open data and begged the question of is this data really open if I had to manually write it out?

The sheet was done simply in several columns, that covered all the categories that the website featured. Some spaces were left blank, this was because either the category did not apply to them, for instance, Secondary schools don’t have a ‘% Meeting Reading, Writing & English’ category.

Step 6: Find the governors data

The next step was to find the name of the governors for the second tab in the Excel sheet. Again, this could not be done via scraping so I had to manually go to each schools website (as it was not on their government website) find their governors list and copy it into the Excel document.

Step 7: Add the governors data to Excel

This is what the final product looks like once all the Governors had been added. As you can see, some entries are missing. This is because the list of governors was not on their website.

Step 8: Try and fill in the missing data

To address the missing gaps, I contact the schools on the list asking them for either a link to it on their website or a copy via Email. I did not take the formal FOI route because, I felt (and I was correct) that they all responded within 24-hours with the list, whereas an FOI may have taken more time.

Step 9 : Add company directors

The final step was to find add the information about company directors to the Academy sponsors. To do this, I took the ten most common Academy sponsors in Birmingham (As these kept this element focused) and using companies house found their number. From there, I searched that number on the companies house database before copying all their directors into the Excel sheet.

Step 10: Excel the hell out of the document

The next step was to mess around with the data to find some trends. The first thing to was to cross the Governors and Companies House list. This was a pretty simple case of using =VLOOKUP. As we will explore in a future post, this came up with no results.

In addition other basics Excel programs were executed on the sheet, including:

  • Ranking the number of schools under each Sponsorship.
  • Working out the Birmingham average of ranking.
A.Sorrentino (MA Diary)

Written by

This is a new account directly for my MA at BCU. Here I will be posting blogs about my MA in Online Journalism. Follow me: @AsorrentinoUKBN

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade