In January, we (the CauseHub Team) attended FCOHack, a hackathon that was organised by Rewired State and the Foreign Office in order to see what kind of useful apps developers could build with their newly published data…and it sure was a challenge! It’s really important that when data is shared openly, it’s done so in a way that people can use, and computers can process. I’d like to highlight some of the inadequacies in the data sets we tried to use. These simple errors rendered valuable data totally unusable.
To start off with, something that was a major hindrance was that the datasets were missing key columns. For example, in the treaties CSV, all of the really valuable data was merged into one column. This is annoying from a programmers’ point of view, because computers cannot think like humans. This means that if we want to make the different pieces of data in that one column usable, a developer needs to write a custom script to process it. This could be easily solved by the different pieces of key data being in different columns.
To get around this, we had to go through the long process of scraping all 15,000+ treaty records from petitions.gov.uk, which actually ended up taking us the whole of the first day. It defeats the point of releasing a dataset if a developer is forced to scrape the data anyway.
Another thing is the fact that the United States was referred to in three different ways over the different datasets (“United States”, “The United States” and “The United States of America”). One of the other major rules to keep in mind when creating public data, is to keep the data consistent! The main reason why this is such an important thing to be aware of in big data sets, is because it makes data cross referencing very hard to do. It is also not very hard to do this thanks to all the web standards other there like ISO 2 and ISO 3.
The final point I would like to make is, don’t litter data with natural language. This kind of goes back to the first point, about how important it is to make your datasets as easy to use as possible. An example of this is if there was a column in a treaty’s dataset that contained, “[Treaty name] signed between [Country 1] and [Country 2]”. This is perfect example of “littering” data with natural language, because there should be no need of the “signed between” and the “and” between the key pieces of data. There should instead be a separate column for each of those variables.
I would like to finish with how important it is to make data machine readable. If data isn’t made easy to process by a computer then there is very little point in releasing it. Not making the data machine readable makes it only useful to humans, therefore why wouldn’t they just go onto the online “user friendly” version instead?
All data should be made in accordance with these 3 golden rules for making public data useful:
1) Keep key data separated in different columns. As I mentioned above it is very important to do this, because you want to keep your data as machine readable as possible.
2) Make sure that all data and column names are always uniform. This increases the amount of people that will use the data that you provide. Making everything uniform, reduces the amount of extra work the developer has to do to implement your data in their app.
3) Don’t litter data with natural language. Try to keep your datasets as free as possible from words like “and” or any other connecting words. These types of words are just completely unnecessary, and just make it harder for a developer to seamlessly use your data. Remember they can always be added in the implementation stage to make the data sound more natural again.
It was great to see that the Foreign and Commonwealth Office is taking steps to make their data more openly available — it improves citizens’ ability to innovate and better understand the activity of the state. However, there clearly needs to be a deeper change in the way that the data is created and managed.