Open Data Scotland (6th and 7th November) — Team Data
This blog post has been compiled by the team, I’m just the publisher!
Have you ever analysed data without full knowledge of all of the original data sources and how they mesh together? Or have you ever meshed together several disparate and messy datasets? I know I have and often it’s a time consuming procedure that requires a lot of homework and sometimes, some very big judgement calls. At the “Open Data for Scotland” event last weekend, we worked on the process of what happens when someone is working to combine several sources of input data into one new open data set. We think that in these data collation efforts, documenting your process and making this documentation available to others, has huge benefits. This could be done in the same way that an analysis or data collection would be documented. In addition, one could then leave as a legacy the code with which data was collated and cleaned. We listed some benefits of this:
Other users analysts or consumers can:
· Re-run the process later with updated data
· Change the process later and rerun it if they think there is a better way of analysing/combining it or if the input data has changed slightly
· Trace items in result data back to original source data
· Check assumptions and decisions made
· Take analysis from one area and re-run in another area (e.g. geographical)
Without this documentation, someone looking at a data set has no way of finding out and judging its accuracy, what comparisons are accurate or what problems the data might have.
We designed the structure of a form or checklist for when you are performing data collation. This can act as a generic guideline to aid in the process and documentation of data collation.
What is your intention in doing this data collation?
For each data set:
What is it?
What license is it? Open or closed?
In what format?
Valid from/to dates? How often is it refreshed?
Describe each field
What do any codes in the data mean? Do definitions differ across sources? Data Taxonomy.
How do you judge the reliability of the source sets?
What is the range of error in the source sets?
Was the input data filtered?
What fields in the source data mapped to fields in the output data?
What rules for mapping/combining data fields where used?
What rules for mapping 2 items in the input data sources to 1 item in the output data set were used?
l If there are any conflicts in your input sources, how is that shown in results? Rules for handling?
Do any fields in the results data set show the source of that data item in the input sets?
Has any data been hand edited?
How have you verified the output?
What is range of errors in result?
Valid from/to dates? How often is output data refreshed?
For each output format:
are there any limitations?
For each software / code used:
What is it?
License? Open or Closed?
Any license / IP restrictions in output data?
Does data meet the intention?
This is not a prescriptive set — some may be more or less relevant based on what you are doing. These are guidelines to make sure a large amount of areas have been covered.
These are also, mostly, questions that do not have a right or wrong answer. The point is not to award a fail or pass at the end. The point is to make the relevant information documented and available so others can judge later. For instance, Closed Source software may have been used in the analysis. For some people this is fine, but for others this may be a problem. The point is to make sure this is documented for later.
When thinking about these questions, there are two types of user to consider:
a person filling them out as they work on a data set.
a person who wants to consume a data set, and wants to know more information about it in order to judge it or reuse it in some of their own work.
Something like this process of publishing the methodology happens in other areas such as scientific papers, Medical meta-analysis, and Social Science analysis. Just after the hackathon, we found this blog past talking about the same issue in data processing in Science. https://theconversation.com/how-computers-broke-science-and-what-we-can-do-to-fix-it-49938
Overall we agreed that this is a useful endeavour and would probably aid in collation and interpretation after the fact. This is by no means a completed piece of work and if you are reading this and have some comments or thoughts of your own about how to collate messy data in a reproducible and documented way, get in touch!
Roman Popat, James Baster, Iain Paton, Elena Dumitrana