Talk: The social impact of open data

This is based on a guest lecture I gave to 3rd year data science undergrads at Cardiff Metropolitan University.

The most valuable open dataset in the world?

On 12 January 2010 an earthquake struck Port au Prince in Haiti causing widespread destruction across the capital of this extremely income poor country. A huge international humanitarian response was initiated. Search and rescue teams flew into the area followed by specialists in providing shelter, water, sanitation, and food. Coordination of these humanitarian agencies was, as it always is, a significant challenge given the damage to the communications infrastructure and the poor quality of maps of the area.

The developing world is, and was in 2010, poorly mapped.

2010 was different though. The crowdsourced mapping database OpenStreetMap was 6 years old and the humanitarian GIS team Map Action had been using OpenStreetMap data for three years. The infrastructure existed to build a map on the fly.

And that is exactly what happened.

Teams of volunteers across the globe, many of them in universities worked to improve the mapping data for Haiti. They needed imagery of the ground. For one of the first times in this situation satellite images and aerial photos were released under an open licence.

It’s not clear what difference the map made in this crisis though it was widely used. OpenStreetMap data has increasingly been relied upon by humanitarian agencies.

MSF and British and American Red Cross now have a project to improve OpenStreetMap coverage in specific countries in the developing world.

To humanitarian agencies what matters is that there is a map. The openness of the data is not the most important aspect.

But this map can only be created because imagery is available under open licenses and because the volunteers licence their work under open licenses.

It’s the data that delivers the impact.

It’s the openness that delivers the data.

Data that anyone can access use or share

So let’s take a moment to check we are talking about the same things.

The Open Data Institute defines open data as “data that anyone can access use or share”.

Open is not, broadly, a characteristic of the data. A single data point may be open, a vast treasure trove of big data may be open (like GDELT if you aren’t familiar with that check it out and lose yourself in a big data rabbit hole for a long time).

It isn’t primarily a government issue either. We’ll see in a minute that it’s easy to make a case for why governments should open their data but commercial and non-profit data owners can, and I would argue should, open their data.

Format is relevant. For data to be something that anyone can access it must, broadly speaking, be available on the internet and in a format that makes it easy to use. So not PDF, at least CSV.

“Open” primarily refers to the permissions the owner of the data gives to others about what they may do with the data.

Can data be a bit open?

When we think about this question of permissions for even a little while we start to understand that there is a spectrum running from closed to open.

I spoke more about this but as we’re on the web I can just point you to what the Open Data Institute has to say about the data spectrum.

The case(s) for open data

Arguments in favour of open data fall into three broad categories.

That open data is a social good.

It empowers citizens relative to organisations and governments and enables them to take part in society more fully. An aspect of this is that it enables citizens, journalists and others to hold organisations and governments to account.

In Western democracies like the UK this is often used as an argument that open data will improve public services. This is often called the “open data auditor”.

In this argument: open data is simply the most useful sort of data.

That open data is good for the economy.

It makes it easier to start viable businesses and promotes innovation and new approaches. Innovation is seen to come both from new perspectives on existing data and from linking different, often disparate datasets.

And, in the case of governments, that citizens of a state already own the data and should be able to freely access it.

This was the basis of the, apparently largely successful, Free Our Data Campaign in the UK (from the Guardian newspaper).

So logically when we look at social impacts we would expect to see benefits in at least one of those areas.

Some case studies

Dengue and Zika are spread in Singapore by the Aedes mosquito. The mosquito breeds in stagnant water and citizens are encouraged to cover or remove any sources of standing water. Since 2005 the government has published a daily update alerting citizens to the location of Dengue outbreaks.

This gives citizens up to date information on the risks in their community and prompts them to take appropriate action. It is linked to on the ground efforts that the government’s public health teams are taking to tackle outbreaks. The site is certainly popular and citizens and journalists regard it as a trusted source of information about the risks across the country.

The data is available via an official website and via an API. Developers have built a range of services to add value to the raw data. For example DengueLah is a smartphone app that alerts the user if they come within 2km of a Dengue outbreak.

Transport for London has been publishing feeds of a wealth of data about its services including timetables and live service status since at least 2010. It builds its own apps and services on these feeds. A thriving ecosystem of third party apps has been created to make use of this data.

Citymapper is an obvious example. It provides a really useful app to get you from one point in London via any combination of modes.

Citymapper therefore knows a lot about where people want to get to and how easy or not it is for them to do so. It has started to use that insight to develop new bus (and taxi) services.


The GPS satellite network began in 1976. It was restricted to military use. In the 1980s the US government began to allow limited civilian use of the system.

Across the 2000s the system was decrypted to allow civilians access to the same level of accuracy the US military had. And then the US Government committed to maintaining civilian GPS permanently.

We all can think of countless examples of applications that rely heavily on this global open data set. Shipping, aircraft and private cars now navigate across the globe with their position plotted within meters at any time.

At its heart it’s a remarkably simple dataset. Essentially a time signal. And yet it allows for incredibly rich applications and services to be built on top of it.

And just one more.

Open datasets and crowdsourcing allow companies to sell insights to farmers in the Yemen about where to dig irrigation wells.

There are many, many examples of open data adding value.

But how much is the impact down to the data and how much to the openness.

This is how I think about that question.

(openness + salience) * capacity of user = impact

Three factors affect the impact of a dataset on the world:

  • how available the data is (openness)
  • how directly relevant it is to a real-world problem (salience)
  • the degree to which people can make use of the data (capacity of user)

Let’s think about the GPS satellite network. When it was originally introduced it had a significant impact but only in the US military. Knowing where you are is important for soldiers, sailors and pilots so it was a very salient dataset. And the military could train people and give them suitable kit.

When they opened the dataset it certainly increased the impact but really only for specialist users. Civilian sailors and civilian pilots adopted the technology but really it took the advent of consumer electronics and then smartphones to enable the huge uptake of it now.

It’s certainly possible to have social impact with data that is not open. I argue that the more open the data (salience and capacity being equal) the greater the social impact is likely to be. Similarly you can open as much data as you like but if the data is not salient to real world problems or the users lack the capacity to make use of it it will have little impact. (Local authority data portals I’m looking at you).

The future: connected cities and empowered citizens?

We are just at the beginning of the open data journey. Across Wales, for example, there are no local authorities really publishing significant levels of open data. Across England there are a handful of places that you could consider good practice in open data publishing.

The dream of open data is one of a connected city where all of the information about transport, congestion, air quality, social care, taxation, everything you can imagine is available. Applications would consume the data and dynamically manage the environment as a result.

This is such a different way to think about public services that it’s worth asking whether our public institutions have the capacity to develop management and governance systems that can in fact cope.

There isn’t a consensus that this data should be open either. There is a real possibility of a dystopian future where large tech companies consume the data about our lives and lock it away from us and from our public institutions

And will citizens have the capacity, skills and tools to make effective use of this sea of data.

We need many more projects that help citizens gain the benefits of open data and help public servants and companies recognise the importance of open rather than closed public datasets.

Feel free to stop reading right here. This last bit was aimed directly at the students.

If Open Data is your thing don’t forget that we organise Jelly co-working days for data folk in Cardiff (and now in other parts of Wales too).

Also maybe you’ll be interested in this workshop looking at Ofcom’s open data with our good friends from ODI-Leeds.

A note for data science students…

You are studying data science so it seems like a reasonable question to ask.

What is the link between open data and data science?

One way to think about this is open data is just another sort of data. As a data scientist you look for insights in the data. You don’t really care where the data comes from.

But I’m going to ask you to think seriously about open data in your careers.

There’s a lot of it about. Very often you will find that it adds value. So the first thing I’m going to ask you to get into the habit of doing is to always ask yourself if there is data out there that could be adding more value.

Open data isn’t just for consumption. Once you’ve got your insights the data is a waste product (sometimes called the data exhaust). So the next thing I’m going to ask you to do is to get into the habit of releasing your data where you can. This will require you to think about the licensing restrictions on the data that you use. Licensing is boring but I urge to to embrace it.

To maximise the promise of open data, citizens need help. You could help them. For example the Open Prescribing service consumes NHS open data and provides a simple point and click interface that allows non-technical folk to understand and compare prescribing practices. This is hidden data science and it’s data science for good. And while we’re on the subject please get involved with DataKind: a fantastic charity that encourages data scientists to volunteer with charities.

As data scientists I hope you will see yourselves as data professionals with a role in society that goes beyond your day job to help all of us understand and make use of data for the good of us all.