I have spent most of my career working with other peoples data. With all the clients and projects I’ve encountered the intention is universal and well meaning — clients want to make their data useful.
Some want to make it useful for their enterprise, others want to extend the utility they perceive to the community at large via data portals and open data. The universal challenge is that utility is in the eye of the beholder and even significant investments do not seem to guarantee utility.
To demonstrate my premise of data utility, the graph below illustrates that interest in Open Data has largely gone unchanged over the previous decade. I believe interest is a reasonable proxy for utility. Therefore I think it's safe to say that we haven’t collectively made all that much progress on making data useful, particularly in an open data sense. I will attempt to summarize the key things I’ve learned via trial, error, and research.
The big focus on data portals has been on the cataloguing systems as opposed to the data itself. Some great work has been completed (CKAN, CSW, and many others) that enable users to find “data” just about any way imaginable.
Users can search, filter, use maps, associations, tags, filters in interfaces and across services and APIs to find data. The catalogues provide support for a wide variety of file types and even the ability to register and catalogue APIs and services enabling data federation. This all sounds good and it’s intention is truly noble but too often this plays out as a really good catalogue for variable information.
The Lake Winnipeg Basin Information Network Data HUB illustrates this concept clearly. Well intentioned data sharing creates a well organized repository that is not necessarily fit for any purpose or user type. Users will often not self-identify with this type of catalogue as it doesn’t feel like it's “for them”. Catalogues are important but from my perspective it's important to invest at least as much in the data itself and avoid the trap of building an awesome catalogue then filling it because you can.
Focus on the highest value data to the user communities you serve.
Avoid diluting that high value data with other data — especially when the other data is available elsewhere from a more authoritative custodian.
I love the idea of Open Data. I feel that if more data was open and usable we could better analyze policy, better understand problems, and ultimately move forward collectively as a society. It should be the best thing ever for data geeks.
Unfortunately I hate Open Data. While the catalogues do enable me to find promising data easily its painfully common that the data is actually nothing but a rendered down report or worse, a PDF. For example The City of Toronto Data Catalogue looks tremendous with numerous results that would perhaps improve the work I am doing on the Municipal Risk Assessment Tool (MRAT). However when I download the “data” I am disappointed to find a generalize report which I won’t be able to use.
I cannot single out Toronto for doing this wrong because this is the norm in Open Data. Perhaps they intended to provide something great, but after wading through the privacy and regulatory requirements this is all that could be shared publicly and with good intention they shared it. In theory sharing something is better than not sharing at all. Right?
The difficulty this causes the data community is users quickly lose trust in Open Data. After wading through an Open Data portal or two it's easy to give up on ever finding anything useful and return to chasing data like some sort of inefficient super data detective. I’ve heard of Open Data initiatives losing funding because of lack of returning users, perhaps failing to recognize how those users were under served and the cycle that creates.
Share good meaningful data, or do not share data at all.
Avoid the tempting middle ground of sharing data that is not useful or complete.
When catalogues enable data stewards to share data in any format it’s hard to count on it. Will it look the same when new time based observations are added? Will it vary year over year? Will I need to update my analysis and models or is their some stability in the schema? These are all valid questions and can lead to a lack of confidence.
For example, the Government of Canada recently launched a new Open Data initiative which publishes water monitoring data. I live in Golden BC so I was quite curious to see the observations in the Columbia River Basin. I was able to find some interesting data (though without any metadata). I’d like to know if the samples were measured consistently over time? does each observed property exist at every sample point or did the sampling parameters change in the past 15 years. These would all be important to know and in my opinion represent a flaw with the CKAN data repository approach.
One of the biggest challenges I have encountered is when data is not of a predictable structure. What does one do when a numeric field recording a series of measurements for an observed property suddenly contains a string indicating some manual note about a given record. There are of course many strategies for cleaning this data ranging from going back to the source (good but costly) to throwing the record(s) out (ok sometimes). There are numerous ways to define and enforce constraints on data structure but they are rarely applied. We love the JSON Table Schema provided by the frictionlessdata.io community. Knowing the schema helps us know what to do with a given dataset.
There are many more characteristics of good and usable data. I will elaborate on the Tesera approach to establishing a data foundation through automated error detection, sanitation, storing, publication, and analysis in the next post on this topic. We’ve developed a number of tools that ensure data end users can quickly use data as they wish and not have to deal with surprises (datafriction).