Seven Silos
One of the broadest problems a data architect tries to solve is connecting users with reliable data, or sometimes just providing them with reliable access to any data.
I feel like once you reach a certain size things tend towards siloed working in all but the most forward-thinking organisations. You will nearly always find that data is too. Contained, secure, and inaccessible to all but a few. Here I try to categorise the most common silos I have found data in, and some of the ways I try to go about breaking that data out of them safely.
I appreciate that these have a bit of a data in government focus, but I work on data in government, so…
The Metadata Silo
I think we can all agree that this is a stone-cold classic. Missing or confusing field names, missing data types, no indication of the time period the data set covers, and a lack of field definitions are examples of metadata problems.
Metadata as a term can be quite broad, I am talking about the “hard” characteristics of a dataset such as those described in vocabularies like Dublin Core and DCAT.
The good news here is that missing, incomplete or incorrect metadata is relatively easy to fix. You pick a vocabulary, you read it, you apply it (or a basic subset if it), et voilà you have fixed the problem.
What do you mean I glossed over the part where it takes many hundreds of hours of work by many people to get to that point?
There’s no getting around it, if some of your data has missing or incomplete metadata, it is likely that all of your data does. The question you need to ask is “is this an investment worth making?”.
There is an argument that the only piece of metadata you need is an owner, because you can always contact the owner for the other metadata, so always start with that. Identify your datasets, make a list of owners and go from there. As far as metadata goes, the more the better, but it can be a case of diminishing returns. I would say that a good rule of thumb is that any missing metadata that 15 minutes examining the data set and 15 minutes talking to the owner can’t surface should wait for another day.
There is another flavour of metadata however, that is not particularly well served by vocabularies alone, but can cause just as much pain if it is missing.
The Expertise Silo
Data in this silo might be readily available to users, but is complex or sits within a very specific domain that only those who work with it on a day-to-day basis can leverage any value from it.
This silo is almost impossible to eliminate entirely and quite common within government as policy tends to lead what data is collected and used. That is shaped by users with incredibly deep subject matter expertise who may not consider it useful to others¹.
This is again where having a culture of information asset ownership in your organisation will really help (I am forever grateful for the groundwork laid for me at the FSA by our Information Governance team), because it gives users a direct line to someone who can field domain related questions.
The challenge is not in identifying suitable owners, but bringing them on board with the obligations that come with being an information asset owner. You need to appeal to their better self that there is a tangible benefit to the organisation and them in the long term, that previously unseen insights might be there in the data they had not considered and their contribution to understanding that data set is key².
This approach is great for a first pass, but much like documenting institutional memory, you will need to take a longer view. This is where documentation, aka qualitative metadata, comes in.
Good documentation means users can make meaningful judgements about how useful the data will be for their purposes without having to try it first. As a profession, I would very much like to see data architecture make some strides in this area.
What do re-usable patterns for describing the qualitative properties of data look like?
How do we talk clearly about biases and limitations in data sets?
Should qualitative metadata be included in vocabularies like Dublin Core and DCAT?
I don’t know, I would love to hear from others with experience in this area.
The Technical Silo
This data might be held in a team’s shared drive, behind a piece of network infrastructure, or in the reporting suite of a specialist piece of software, only accessible by those with a license or the training to extract it.
Data which exists in well-structured relational databases which support live services often fall into this category. Often data in these silos needs to be extracted manually, before being sent onward to be transformed or manipulated by someone else.
Mitigating this comes down to embracing good service design principles. Designing the data in your service and not just considering it a by-product are key.
I would guess that ninety-nine percent of data stuck in this silo are not from shiny new digital services but from the heap of legacy applications. The question you really need to ask is, “what is the cost of breaking down this silo, and what are the potential benefits?” and then look for the lowest costs way to get that data out.
Maybe an overnight dump of a table will do it, maybe you want to provide an API. Sometimes you realise that actually the best thing for this silo is to empty it altogether. We need to have more conversations about turning legacy systems off altogether. These will be hard.
The Legal Silo
Data which is controlled by a small group of users on the basis of a legal requirement or commercial sensitivity. This kind of silo can be particularly difficult to break down as when you ask for access to it the most frequent response is “legally we can’t use or disclose that” which may be true, but frequently goes unchallenged. Sometimes you have to ask people to show you chapter and verse.
Another way this silo can affect business decisions (especially in government) is where we are collecting or storing data because it is believed that we have an obligation to do so, and that opinion goes unchallenged for so long that it ends up baked into the end-to-end business process. This is an example of a business process decision being around so long it ends up masquerading as a user need.
Users who want the data need to be prepared to work with Information Asset Owners to reach a point where it can be made safe to be brought out of the silo. Balancing this need and keeping the data useful can require considerable negotiation and a lot of asking “what’s the risk of sharing this data?”, field by field.
Internally you might benefit from the approach “if you work for this organisation, you are trusted to use the data it holds safely and effectively”, but this can feel counter-intuitive where you have a strong culture of information asset ownership. Balancing these two points of view is difficult, you should work closely with information governance colleagues to communicate to your organisation that these two points of view are complimentary, not adversarial.
When designing or redesigning business processes, I like to challenge assumptions based on perceived legal constraints where a significant amount of time has passed or the landscape has changed, even if it just ends up confirming the original position. It doesn’t always work, but the conversation is always enlightening.
The Ownership Silo
This would include data to which your organisation has access, but does not own. This often creates a situation whereby the person with access to the data is reticent to share it with colleagues because they may not fully understand the implications of doing so.
Where you have a data sharing agreement with another organisation, you should try not to let that agreement impose limits as to what you can do with the data (within reason).
Where you have a specific reason to use data from elsewhere in the course of your job, you should make sure you clearly understand your rights an obligations around that data. In all cases you should look very critically if you should hold the data at all.
I am in danger of veering into the realms of technical principles of information governance and data protection here, which I am wholly unqualified to do. So to keep it short, make friends with your information governance team if you don’t already work closely. Talk to them about the scenarios where you can see that data obtained for the purposes of one thing could be good for another and between you work out what a baseline for your organisation should look like when sharing data with others or others share data with you.
Alternatively, just use open data where possible, it’s good stuff. I hear there’s a camp people go to talk about it and everything.
The Forgotten Silo
Data which exists entirely within the confines of a personal drive, perhaps gathered as part of an abandoned project, or raw data used in a study for which the results and summary are published but the files are not.
Data in this category often ends up forgotten about because it doesn’t tend to be considered part of records management best practice, the outputs are, but often the underlying workings are not because they are considered to messy or just part of a process.
Tackling this kind of silo requires a culture that includes actively managing the data which underpins analysis and studies. The good news is that this is relatively easy to tackle and can often be included when you do a sweep to tackle your metadata problem.
The Effort Silo
You don’t even know you’re in this silo until it’s too late. It can take a little from some or all or the other silo types, combining them to create possibly the most dastardly silo of all, effort.
When a new user tries to use this type of dataset, at first all can seem fine, and in some cases they can even successfully use the data for the task they had in mind. This silo frequently only manifests itself after more rigorous analysis has been carried out on it.
Users may find that once this analysis has started there is something fundamentally wrong with the data that they were unaware of before, or the results they get from using it point to one or more problems. So they give up on using it, or worse they decide that they are now forced into a position where they have to collect the data all over again.
Effort silos can’t really be tackled directly, you can only hope to eliminate them by tackling the others. It is worth considering destroying data which seems to be in this silo, in order to protect the energy of the next poor sod that might stumble across the data in a eureka moment only to have their hopes and dreams dashed after an enormous amount of labour.
[1] They might not consider the data useful to anyone else, not their expertise. ;-)
[2] This is my preferred method, but sometimes it just straight up doesn’t work. Some people don’t have a better self to appeal to I guess?
