Does your organisation’s data reflect the world ‘as it is’ or ’as it should be’?

7 min readJul 14, 2018

We make decisions based on our understanding of the world and that’s shaped by how our world is modelled back to us. That’s why data modelling conversations are always interesting, revealing and sometimes, intense. How we decide what to capture and to leave out isn’t a neutral process — it’s determined by the power dynamics at play between stakeholders with different experiences, priorities and worldviews.

Obviously, organisations contain a multiplicity of all three of these things. Having said that I find that people generally take one of two broad positions when it comes to how they model the world they operate in: (1) the world ‘as it is’ and (2) the world as it ‘should’ be. Their data gathering and handling follows suit.

Further, I’ve observed a strong correlation between people’s position on modelling and their position within their organisation’s structure. Generally, in my experience, a preference for modelling that reflects the world as it ‘should’ be is highly correlated with holding a ‘strategic’ role. On the other hand, a preference for modelling that reflects the world ‘as it is’ tends to be more common among those in operational roles.

Modelling the world ‘as it is’ versus the world as it ‘should’ be.

By data that reflects the ‘world as it is’ I mean data that describes actual things — people, objects, locations and events . It’s usually collected through existing utility-generating activities and services directly related to these POLEs. It’s collected in order to facilitate getting things done rather than for ‘reporting’ purposes. Let’s call this sort of data, administrative data.

In contrast, I classify data that underpins a ‘world as it should be’ model as primarily management reporting data. The sort of data that’s usually collected in order to support or inform internal decision making. An example would probably be helpful at this point.

Say, there’s a need to model English local authorities, what’s the best way to collect data to do this? The UK government’s Department of Housing and Local Government did so by creating an authoritative list of English local authorities as part of the government’s register programme*. In doing so it took a world-as-it-is approach. Richard Vale (then of the GDS Registers team) blogged about the process here.

The below-the-line comments on the post are interesting as they bring to life the tensions between some users’ world-as-it-should-be needs and the register custodian’s world-as-it-is offer. For example, one prospective user pointed out that a single UK-wide register was preferable to four separate country registers as would some information about the hierarchical relationship between local authorities. In response, Paul Downey (also of the Registers team and the person who originally conceptualised this approach to creating registers) explained that:

“It might be a little more complicated having four separate registers, but it’s how government is organised. We don’t so much see this as a problem, rather an advantage as we’re moving data closer to where it’s actually made. Part of the criteria for becoming a custodian is an ability to demonstrate how maintaining the register is part of a business as usual process in which the custodian has authority, e.g. the people drawing up Statutory Instruments to create new local authorities or change their name change the register ahead of the changes being enacted.
… you’re definitely not alone, we proposed having a “parent” field in this register during our discovery, but this was something our custodian was very opposed to, in part because we were unable to find a name which didn’t imply some kind of hierarchy between authorities (in terms of power), which he explained very clearly and far better than I could, does not exist. We don’t want to perpetuate a misnomer, so left the parent field out of the alpha.”

*Full disclosure: I used to lead the Data Infrastructure programme in which the Register team sat.

The world-as-it-is forces a broadening of our perspective, the world-as-it-should-be necessarily narrows it

The prospective user’s comments reflect a world-as-it-should-be view of data and that’s valid. After all, the point of all this data is to build stuff, make decisions etc., y’know, put it to use. Administrative data, of which the local authority register is an example, reflects the messiness of the real life data collection process and the constraints those collecting it (in this case the custodian) are under. There is a good degree of pragmatism in their approach — they balance the cost of collecting data against the value derived from its use. The thing is, prospective users have such a wide range of very specific needs. In order to maximise the value derived from the data, those collecting it need to serve as wide a range of user needs as possible. This is is why it’s so important that senior management doesn’t force the collection of data in a way that meets their particular world-as-it-should-be model. Senior management is just one of several (re)users of administrative data. The data collected will meet some of their (and other users’) needs really easily but in some cases it won’t and some extrapolation will be necessary but this is preferable to burdening data collectors with more work. Why? Well, there are a few reasons but I’ve cited my top two below.

1. There’s no such thing as a free data (collection) lunch

Insight gathering is an expensive business. That’s why, for example, we conduct the census only once a decade. Would it be nice to be able to assess demographic changes on a more frequent basis? Yes, but is it £500 million worth of nice? Arguably not. So much so that the parliamentary Treasury Select Committee raised concerns about the growing cost of conducting the census. ONS and others were subsequently asked to explore alternatives to running a full census; the reuse of administrative data from a variety of sources was one of the options they mooted as a way of bringing down costs.

As mentioned above administrative data is tied to real life, utility-providing services so there are usually some validation processes in place which means the data quality is relatively (compared to say, data scraped from social media sites) high. However, it’s also worth noting that administrative data reflects the needs of its primary use case. It’s only cheaper than survey data because the reuser isn’t having to pay for customising it. Customisation would require those generating the data to take on more work. This would either distract from their core function or require a boost in their numbers, thus driving up costs.

2. The world is messy, pretending otherwise makes our experience of it more unpredictable and risky

In an old blog post about a shortcoming of the (then new) GOV.UK website and its highly ordered, standardised approach to publishing , Jeni Tennison noted:

But if there’s one thing that the last five long, hard years working with legislation has taught me, it’s that in any vaguely interesting domain, this search for order will always fall down in the face of reality.

I don’t agree with all of Jeni’s points in the post but I think the one about the messiness of complex domains and the hazards of trying to abstract that away, is spot on. Oversantised data hinders real understanding and that harms decision-making. George E.P. Box noted that “all models are wrong but some are useful”. He was, of course, right. Sanitised ‘world as it should be’ reporting data with no link back to the messy ‘world as it is’ data from which it was derived, obscures the truth of Box’s aphorism. In the worst cases, it infuses decision makers with a false confidence and entrenches the sort of ‘God complex’ that Tim Harford mentions in his TED talk. He illustrates this point by referencing the graphs of Hans Rosling, the famous physician and statistician, often produced. They were beautiful and accessible but also an oversimplification. And that’s the thing, with this sort of data even decision makers who aren’t afflicted with a ‘God complex’ will make poor(er) decisions. The data robs them of any sense of the scale of complexity in the world they’re required to make decisions about.

Tim Harford video about Trial, Error and the God complex

It always comes down to infrastructure in the end, doesn’t it?

I’ve emphasised the differences between data that serves world-as-it-is and world-as-it-should be models in quite stark terms. I’ve also argued that the former is critical. In so doing I may have given the impression that organisations need to choose one or the other. But the truth is both models are necessary for effectively running organisation. The point I really want to make is that re-using administrative data and extrapolating from there is good practice.

This sort of reuse and extrapolation requires a rigorous and systematic approach to analytics. So organisations need to invest in infrastructure capable of supporting it. That includes automating the collection and processing of administrative data in consistent ways that ensure reproducibility and keep down costs. In order to support sensible extrapolations, it must also be able to capture useful data about the context in which the administrative data has been collected.

Solid infrastructure also makes it possible to build tools that make it easier for a broad range of internal and external users to put administrative data to use. For example, DWP’s data visualisation tool, Churchill reuses data from ONS’s NOMIS (among several sources) — which provides a mix of survey and administrative data. And although Churchill doesn’t (yet?) use GOV.UK registers data, this is the sort of use suggested by the latter’s documentation:

“Registers only provide raw data. You cannot use the APIs to search a register or match data. Depending on your requirements, you may have to build indexes on top of a register to fulfill specific requests.”

The above examples are from the public sector but this applies to private sector organisatons too. In a previous post I wrote about how DeepMind’s energy saving algorithm was built by reusing historical sensor data from Google’s data centre as training data. The upshot is that Google shaved (a significant) 40% off its energy bill. That’s not a bad return on its investment in solid administrative data and the infrastructure to make it accessible and re-usable. It’s worth pointing out that the investment came first.