Investigating crime data at small geographies
A bit of background (skip if you’re familiar with h-map)
I’ve been doing some work with Herefordshire charity the Bulmer Foundation (and a lot of other organisations) to create a central resource of data about how sustainable Herefordshire is becoming. The Foundation has been working on this for some time and, following a lot of consultation, they have identified the key indicators of sustainability for the county.
If these things are getting better, Herefordshire is developing more sustainably. If these things are getting worse Herefordshire is developing less sustainably.
Identifying the indicators is important but it’s not the whole task. Next we have identify the data that tells us if things are improving or not for each indicator. Then we have to display it in a useful way.
The Foundation has commissioned some development work which is still in progress but play-about-with-able at h-map.org.uk.
Understanding how things are changing at the whole county level might be interesting but for most practical purposes it may not be useful. We need to get into the data at much smaller geographic levels. As a way of investigating how this might work I decided to investigate crime data which is a nice, rich dataset.
This post is really a set of notes about how I processed the data. Partly for my records but also to encourage les autres and to enable people to suggest ways this could be done better.
The police.uk website publishes crime data in a number of ways. There is an API but I haven’t explored that. Humans can download see some data on maps and they can elect to download some CSVs
I asked for everything they had on West Mercia ( the police force serving Herefordshire, Worcestershire, Shropshire and Telford). I received a zip file containing a bunch of csvs. One csv per month for all of the months starting Jan 2011 and ending Decemb 2014. Each csv contains a row for each crime recorded across the force area. Against each row is a type of crime, lat, long, outcome and sundry other data including the LSOA.
That’s great but much more than I need. All I ultimately want to know is the number of crimes recorded per year in each LSOA. If I had better coding/scripting skills I could probably have got a few lines of Python working on the task for me. But I don’t.
Also it was an excuse to play with Datagraft.net This new cloud thingy has sprung from the Dapaas project involving the ODI and other cool linked/open data folk. It allows you to create transformations in a reasonably point and click manner. There is a pipeline metaphor so transformations proceed sequentially. There is a bit of a learning curve and the documentation is early stage but I got the hang of it. It allows you to preview the effect of your transformation on real data which enables effective pressing of buttons until the right thing happens.
So, after a bit, I managed to build a little transformation that pulls out rows here the name of the LSOA contains “Herefordshire” and then creates a 2 column table from this with LSOA and month the respective columns.
I still have 1 csv for each month. It might be that datagraft will append rows onto an existing column but I couldn’t work out how to do this.
So I had a happy couple of hours uploading each csv and downloading the processed file.
What I was aiming for was number of crimes per year per LSOA. So I had to get my monthly files into 1 big yearly file. Which I did manually with the assistance of the excellent TextWrangler application. It really made this tedious manual task a breeze.
Then a simple pivot table for each year gives me the totals I was looking for.
There was a little bit of spreadsheeting to decide if crime was improving, worsening or not changing.
And finally the application of Google Fusion Tables to link the LSOA codes in my dataset to polygons (described as KML) and I have a lovely map painted red, amber and green.
Datagraft enables me to save the transformation so when all of 2015’s data becomes available I’ll be able to repeat the process. It also enables me to to publish my dataset and to RDF it.
Maybe next week for that.
If you have any suggestions for ways I could cut out steps, or improve my data wrangling I would love to hear them.