Before the computers ‘take over’…

Ade Adewunmi
6 min readJan 15, 2018

--

Cheaper computing and storage capacity mean we can now use relatively data-hungry approaches such as machine learning to glean insights and drive operational efficiency in a way we just couldn’t before. There’s excitement about AI’s potential from varied quarters — from Amazon’s Jeff Bezos in his Day One letter to the UK Government in its AI Strategy. That potential extends beyond sophisticated AI and Data Analytics solutions to ‘low hanging fruit’ such as Robotic Process Automation, too. That‘s not to say there aren’t concerns* about how ready society is to handle the accompanying disruption. Both the optimistic enthusiasm and the cautious concern are valid. It makes sense to begin discussing the benefits and risks now. But I think we’ve a way to go before most companies can derive significant value from all their data, never mind computers ‘taking over’. And that’s what this post is about.

*NB: Here’s the IPPR report on which that article was based.

Data quantity versus data quality

The blocker I’d particularly like to focus on here is data quality. As stated above we’re now able to use much more data-hungry approaches to Data Analytics and AI. If current trends persist there will continue to be enough data to support this. However, a lot of current practice won’t support the kind of consistency and integrity that will ensure that quality keeps pace with quantity. And this is a problem because dodgy data yields dodgy insights and dodgy decisions.

There are many historical reasons for data quality not being as high as we’d like but what makes it stay this way is worth examining. I think it’s akin to the productivity paradox driving hand car washes in the UK.

“Perversely, instead of using the high-tech cleaners at garages we choose to pay others to wash our cars using the most inefficient of methods – a hose, bucket, water, soap and sweat. The hand car wash has grabbed around half of the commercial car wash market in the UK.”

The growth of hand car washes has happened despite the fact that automatic car washes are faster, less likely to cause damage and clean better. How do we explain this? Simply put, short-term profitability. Setting up a car wash in the UK today, it’s more profitable to rent a disused space, employ people willing to do or more likely, unable to decline low wage work than to shell out for an automated one.

I think if you’re building a digital service you‘re in a similar position when it comes to cleaning and validating the data inputted by users. Why take on the extra burden? After all, keeping the data ‘clean’ (accurate and validated against the freshest, most authoritative sources) and available to others, aren’t critical for the delivery of digital services. Also, doing this extra work is unlikely to make a huge difference to the running of the user-facing component of the digital services in the short-term. So, why bother? The truth is most teams won’t.

Image created by Rosenfeld Media and released under a CC licence

At present even the government’s Digital Service Standard has a very functional, service-tailored view of data. The 3 data-related criteria ( 15 to 17) are focused on the collection and reporting of performance data for the service being assessed.

Why does this matter?

I can think of two big reasons.

  1. Digital services and processes are the de facto ‘data engines’ of most organisations. This is because services (especially transactional ones) essentially collect and process data in order to meet users’ needs. They also create a lot of useful data as a byproduct. Much of this data is useful to others within the organisation — for the purposes of analytics and running services. A narrow focus on data that is of 'good enough' quality to run the online component of a digital service deprives the wider organisation of a valuable resource.
  2. Cleaning data becomes more costly the further from source and the more of it you have. The cost to the wider organisation from poor quality data is also multiplied.

‘So far, so reasonable’, is what I hope you’re thinking. I also imagine you’re wondering:

  • what are the obstacles to putting the data that services generate to use?
  • where it has been done, what conditions made it possible?

Good questions. I’m going to try to answer them using an example to illustrate. I hope it’s uncontroversial to say that we’re a fairly long way off from artificial general intelligence. However, there’s been good progress with task-specific AI. DeepMind’s energy saving algorithm which saved Google 40% off its energy bill is a good example of this. Here’s an outline of their approach:

We accomplished this by taking the historical data that had already been collected by thousands of sensors within the data centre — data such as temperatures, power, pump speeds, setpoints, etc. — and using it to train an ensemble of deep neural networks.

This paragraph kind of directly and indirectly answers both questions mentioned above. And even though in this case this data was collected by sensors not services, I think the argument holds.

Optimal usage of requires the availability of well defined, consistently collected, enduring and easily accessible datasets. The main obstacle is no-one being given responsibility for this. Because here’s the thing, that ‘historical data’ is the result of a series of decisions made by someone or more likely, some team. They picked what they were going to measure and what they wouldn’t. They decided the definitions for the things they would measure. And then they made a series of decisions which led to that data being discoverable and usable by others. So when the data scientists and engineers were tasked with reducing Google’s energy bill there was loads of useful data for them to work on.

Does that sound like the experience of most data scientists in your organisation? Probably not. One report based on a survey of data scientists suggests that they spend almost 80 % of their time acquiring and cleaning the data they need. This chimes with what my data scientist friends have told me. Given the scarcity of data scientists this is not a good use of their time or skills.

Digital teams need to think about data in the way they do about user research i.e. it’s everybody’s job. Hence my modification (below) of the government User Research is a Team Sport poster. See what I did there?

Cool poster aside, data design and the decisions that inform it are not trivial matters. They really need the insights that a multidisciplinary team can bring to bear . Mimi Ohuoha and Jer Thorpe have written thoughtful, excellent posts about that here and here.

Original image by gdsteam under a CC Licence. Image has been modified as allowed by terms of the licence.

Making this happen

None of these things happen without good governance. By that I don’t really mean boards, although if they’re run in line with agile principles they can be very helpful. I mean supporting the kind of good practice that leads to frictionless data discovery, querying, reuse of and validation against authoritative open (aka registers) and personal (with properly enforced access rules) datasets.

This is where things like the Digital Service Standard, which codifies good practice and sets standards for what 'good looks like' could really add value. For example, it could strongly recommend that services validate data against and reuse the identifiers of authoritative datasets. It could also codify best practices for dealing with well-known issues with identifiers.

Separately, I remain hopeful that the team behind the government AI strategy will issue some advice on what it means for government services. If it does then I hope it will cover good practice for data design and reuse.

What does designing for data quality and reuse make possible?

It opens the door to more people in the organisation being able to benefit from the service data, given the right tools. This would mean people’s ability to contribute to the business or policy focus of an organisation or department isn’t down to their proximity to the services(s) that create the data. This is why I’m so interested in DWP’s Churchill tool and its focus on meeting the data needs of policy people. It would be great to know how they’re managing data quality and reuse.

If good data is a team sport, how do we play as a team?

I think it starts with more and better conversations. It’s also a good way for me to meet my 2018 resolution.

Ellie Craven and I will be pitching a session to talk about this at the upcoming UKGovCamp 2018. Who’s up for it?

NB: I’d like to thank Ellie Craven for her valuable insights which informed my writing of this post.

--

--

Ade Adewunmi

Working at the intersection of data, digital and strategy. Digital organisations and their cultures interest me so I write about them. I watch too much TV