Back to school working on data

Here I am back in The Sensible Code Company offices after four years away. My first two years were on machine tool software at Autodesk which is the antithesis of a tech company startup (don’t get me started). The second two years at DoESLiverpool hack-space self-teaching internet of things type things.

To catch up with the changes (including the name change from ScraperWiki), I begin with what I used to be good at. I hope to broaden out from there.

So… the latest thing in the news-cycle is government schools policy, with the new Prime Minister Mrs May quoted in the Daily Mail decrying the problem of “school selection by house prices” where parents bid up the cost of property in the catchment areas of desirable schools to unaffordable levels.

I don’t know why anyone is surprised by this outcome or why the government thinks it’s a problem, because in order to create this phenomenon the Department for Education has had to invest massively in data creation (aka standardized testing) and open data technology with its interactive website www.compare-school-performance.service.gov.uk to make it easy for millions of families to rate schools against one another using the “hard numbers” so that they can all come to the same conclusion as to what schools are desirable.

It’s the same underlying problem as to why financial advice can’t be sensibly provided without a random number generator — because if everyone gets the same “best” advice, there will be a bubble and then a crash.

It’s not like health advice, where it’s okay if everyone gets the same best cure for malaria, for instance.

We could, of course, choose to operate an educational system like the highway system where transport models predict the demand and how it will reshape in response to a reasonable road construction strategy, and traffic collision data systematically guides investment and re-planning around problem areas to make them safe, rather than dressed up as public advice to scare people away from dangerous places with the fear of death.

But, whatever.

Ignoring these scruples, I worked out from the website that I could download the 2014–2015 schools data using this wonderfully complex link:

https://www.compare-school-performance.service.gov.uk/download-data?download=true&regions=0&filters=SPINE,KS2,KS4,KS4PROV,KS4UNDERLYING,KS5,KS5MINSTANDARDS,VAQUAL,KS5UNDERLYING,VASUBJ,DESTINATION,PUPDESTPROV,PUPILABSENCE,CENSUS,WORKFORCE,SPENDPERPUPIL,SPENDPERPUPILGROUPED,SPENDPERPUPILFULL&fileformat=csv&year=2014-2015&meta=false"

This gives a 56Mb zipfile of CSV files. It’s a shame it’s not just in a single sqlite database instead where everything would be type-checked, indexed, interlinked and joinable immediately out of the box, but this concept is not even on the table.

After about a day of messing around in QuickCode, including a painful hour on a Python27 Unicode codec problem which I could have lived without, I did my thing with the following code:

import scraperwiki, requests, zipfile, io, csv, os
url = “https://www.compare-school-performance.service.gov.uk…"
r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
for n in z.namelist():
table_name, ext = os.path.splitext(n)
if ext != “.csv”: continue
klines = list(csv.reader(io.TextIOWrapper(z.open(n, “r”))))
headers = klines[0]
iURN = headers.index(“URN”)
lines = [l for l in klines[1:] if l[iURN] and l[iURN]!=”NAT”]
data = [dict(zip(headers, line)) for line in lines]
scraperwiki.sql.save([“URN”], data, table_name=table_name)

This yielded 11 tables with up to 366 columns in each, joined by the URN (School unique reference number).

And then you can plot some of the fields from the tables that might mean something in isolation, such as “2014–2015/england_swf”.SALARY (average gross salary of all teachers in school):

I had plans to graph the school budget against its academic performance indices using complex joined SQL queries (it’s been four years since I’ve written anything of this kind), but then I ran out of time and mind-focus before I waded into this thorny briar patch.

And this, I see, is the main issue with the practice of data science.

There is no way you can do anything until you have assembled and packaged up the data. So the job of data acquisition gets the top-most #1 priority.

As a consequence the job of actually analysing the data gets a lower priority.

Then, what do you know, we don’t have time to get everything done, do we?

But I don’t feel so bad with myself, because at least I’ve managed to tick off the highest priority job on my list.

Even though without doing priority#2 my time has been about as productively spent as Rimmer from Red Dwarf with his engineering revision timetable. I have therefore failed my first exam.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.