Published in

Porism’s Blog

7 min readMay 25, 2018

What’s needed to discover, link and combine open datasets

I’ve been banging on about how we need better ways of finding and combining datasets for so long now with little obvious impact, that I was beginning to lose conviction, but slowly the message is coming home.

We need to improve how we can discover, link and combine related data from multiple sources if we’re going to properly exploit open data.

Whether you’re a data scientist trying to establish the link between licensing hours and antisocial behaviour, a pressure group that wants to alert people to planning proposals or an innovator who wants to enhance a web presence with localised information, you have to take data from hundreds of sources to build something suitable that works across the country.

Advances in computer science have changed the landscape a little, but we still lack basic information identifying datasets, their content and their quality. We need to deploy common patterns to enable combining and linking datasets.

The background

An Open Data Institute (ODI) managed review of standards lead to this comprehensive Open Standards for Data Manual and a more technical guide to Creating and maintaining open standards by my company, Porism, which does a lot of work on open data sharing with the Local Government Association (LGA).

Most of us who were involved in the work feel there’s more to do on the discovery of datasets across the web and the paper on Characterising Dataset Search Queries by the ODI and University of Southampton highlights some fundamental differences in searching for data rather than documents — queries are longer, more structured and often have a geographic or temporal element.

Pure discovery of datasets is not enough, we need to combine ones on the same topic to get a fuller picture (eg combine data for each local authority in England to see a portrayal of the whole country). We also need to know common data items between datasets so we can link them and reveal correlations.

The problem

Publishers whose job is to serve just their own organisations or the electorates of just their own governments have little motivation to publish their data in a similar way to others. Some UK public data comes from just one central government department and so is simple to consume. But data from local government is very different. As Phil Rumens’ A rough guide to central vs local government digital makes clear: there are lots of governments each delivering hundreds of services; money is really tight; lots of non-experts need convincing.

There’s minimal consistency in the format of data coming from England’s 353 local authorities. It’s hard to put the case for a standardised approach, particularly to early adopters, purely on the grounds of local self interest.

A few examples

Crystal Palace neighbourhood forum complains that it has to draw air quality data from five different council websites to get the full picture on its pollution and many of those site change how they report each year.

There’s no consistent picture of planning data across UK councils (although an LGA defined structure is followed by some) and government pays for a privately licensed view of data made consistent from councils’ open data in a myriad of formats and structures.

Local council election data can’t be analysed for the whole country as results are announced. The only legal obligation for councils is to post a paper notice on the town hall notice board.

Amongst this mayhem it’s a brave innovator who tries to do something creative that relies of open data published by many local councils.

The problems of aggregating council level data across one country grow to a whole new scale when considering data from smaller neighbourhoods and spanning many countries.

The solution

We need a means of identifying the structure of datasets and to encourage convergence of useful structures.

The Cabinet Office’s one-off Open Data Incentive Scheme showed that, with some co-ordination and assistance this can be achieved. Some 80 councils publish more than 200 open datasets in pre-agreed formats for planning, licensing and public toilet data.

Linking datasets

The value of open data increases dramatically where diverse datasets can be joined to make new discoveries, such as the link between lead pollution and crime. For that to work, we need to join on common data elements between datasets. Whilst 5-star linked data is ideal for such linkages, simple use of common patterns in data of other formats makes these linkages possible.

Geography is the most common variable for linking datasets as the world is largely agreed on common measures of longitude and latitude. Other useful common elements which are less standardised are: public/elected body; type of service or activity; circumstance/characteristics of a place or person. At least one (often many) defined sets of values do exist for these, but they’re not commonly and consistently used.

We need to identify which standard measure, register or taxonomy is used by each dataset and link directly or by conversion to common values.

Establishing data patterns

Linking and combining datasets relies on establishing common data patterns. They may be patterns for whole datasets (that is schemas) or patterns for individual data items, such as areas, elected bodies, activities and events. Some datasets conform to documented schemas, more incorporate common patterns, but these are hard to discover.

Something analogous to the Open Knowledge Foundation’s (OKF) Linked Open Vocabularies that shows which data patterns are used by which dataset renditions would help surface useful common patterns and make linkages easier.

How do we get there?

We need to: (1) discover what datasets exist; (2) to identify what patterns they use; and (3) to promote the patterns that have proved useful in applications and data analysis.

1 Make data discoverable

Publishing lists of standards and datasets that conform to them (and vice-versa) will help discovery.

Data cataloguing and publishing tools have their own sets of properties to describe each dataset. Some, like Redbridge Council’s DataShare (now open source) and OKF’s CKAN outputs an inventory of datasets. The UK Government’s Find Open Data, formerly data.gov.uk, provides a catalogue of UK public sector datasets. It harvests from catalogues in the above (and other) formats, although it is moving away from CKAN.

The Data Catalog Vocabulary(DCAT) provides terms for describing datasets. The Cabinet Office paper Designing URI sets for the UK public sector and the W3C proposed Quality and Granularity Description Vocabulary add properties needed to gain an understanding of a dataset’s quality, which is important in deciding if it’s suitable for a particular situation.

2 Adopt common patterns

The metadata for a dataset can tell us if it conforms to defined patterns/schemas, but we lack common approaches to associating metadata with a dataset. There’s also no widely adopted approach giving standards identifiers and structured metadata, such as a Uniform Resource identifier (URI) for a standard with linked properties and schemas for renditions in different formats.

Data catalogues following the Inventory standard (like Datashare) include the URL of the schema to which each dataset conforms. There’s a proposed extension to DCAT to say what patterns a dataset conforms to. This isn’t used by CKAN.

Data.gov.uk lets a data publisher pick the schema to which the data conforms from a small fixed set of options and its API lets you extract datasets for a schema (eg datasets conforming to the brownfield land schema). Its new incarnation Find Open Data has removed (at least for the time being) the option to perform a similar search from the user interface.

3 Promote useful patterns

The LGA’s schema page show schemas to which local datasets conform. These include schemas published by the LGA, the Ministry for Housing Communities and Local Government and the Government Digital Service. The schemas build on common sets of fields such as:

CoordinateReferenceSystem, GeoX and GeoY for location
ServiceTypeURI, ServiceTypeLabel for type of local government service

The aggregator combines datasets that conform to the same pattern.

The URI search page lets publishers find specific Uniform Resource Identifiers from within URI sets used by the schemas. Hence common identifiers are encouraged.

Find Open Data provides a drop-down selection of schemas that can be linked to published datasets.

To some extent we could automate collation of the patterns datasets follow if we can find the datasets in the first place. Collation would be an inaccurate process where different datasets use different names for the same properties and where common vocabularies for properties such as DCTerms and SKOS are not adopted.

Alongside discovery of dataset patterns, we need to record which data consumers rely on them, so we can establish how useful they are.

Where next?

With cooperation between interested parties in the UK public sector, we could converge on a consistent approach to discovering datasets and useful patterns for them to adopt.

If we were talking standards for physical nuts and bolts, it would be a no brainer. No manufacturer would develop its own standard and any builder worth her salt could quote standard sizes and types in a near universal language.

In the world of shared data people seem hard to convince. There might be a view that Artificial Intelligence will take the problem away. If that’s the case, will someone explain how it’s coming along.

Otherwise, please can we have a plan?