Where to Find Linked Open Data for Your Home Projects
Now is the time for home projects, where can you start?
As exams finish up and people start looking for new ways to spend their time indoors, it would be fantastic if developers got familiar with linked data!
When starting a side-project, I often head to somewhere familiar like Kaggle for a dataset to play with. These datasets are almost all in csv, json, or sql files that developers are already familiar with. To add a new type of data to your arsenal, I will run through some open linked data that you can use today.
If you have looked for linked data in the past then you will have almost certainly seen the Linked Open Data Cloud which is a touch overwhelming at first:
Topics range across linguistics, geography, biology, government, etc… but where should we even start?
Let’s go for the centre.
Zooming in a little, there is an obvious node that is connected to a huge amount of the other sets - DBpedia.
Most knowledge graphs can be output in multiple linked data formats. If you need at any point, here is a detailed guide to help you understand them.
Introduction to Some Datasets
We all know Wikipedia, the online encyclopedia that is maintained by an open community of volunteers. Wikipedia pages are kept relatively up-to-date and are all linked together (leading to many dives down rabbit-holes) but the information is stored as free-text, some of which is handily structured into infoboxes.
Infoboxes: On Wikipedia, infoboxes are the structured summaries of attribute-value pairs. Usually at the top right of a Wikipedia page.
DBpedia is famously the linked data version of Wikipedia, extracting facts from its pages (mainly these info boxes) to build a huge knowledge graph of encyclopedic information. For example, if you look at the DBpedia resource page for Bennachie you can see that the coordinates have been extracted and stored as triples:
First released in 2007, DBpedia now contains almost 10 billion triples and is a hub that connects to many of the other linked open data sets. The huge size does come with drawbacks however. All of this information is extracted automatically through the following architecture:
Wikipedia pages are ingested and parsed into a range of fact extractors (each tailored to handle different types of information like dates and coordinates for example) before being output as linked data. For the most commonly searched pages (countries, famous people, etc…), this automatic data extraction works well. The more obscure pages however require quite specific extractors and therefore cause small inaccuracies.
To illustrate this, let’s look at Aberlour distillery (definitely obscure). On DBpedia, this distillery has 2 stills. On the Wikipedia page however it has 2 wash stills and 2 spirit stills, totalling 4 stills (the correct number).
Creating extractors for this very specific job is obviously not of high priority as every new extractor handles a smaller number of facts. If you would like to contribute however, you can take a look at their Github and docs.
Wikidata is similarly built around Wikipedia but is quite a different project. In fact, DBpedia and Wikidata are heavily interlinked and are often used in conjunction with each other.
As explained, DBpedia extracts information to generate linked data from Wikipedia. Wikidata however is a project that aims to create linked data for Wikipedia.
Essentially, Wikidata is populated just like Wikipedia as facts can be added by a community of volunteers. Provenance must be given for any added information and a rigid structure is specified to keep consistency. Some automatic extractors are used but this strict creation of linked data minimises inaccuracies and inconsistencies.
This project took off and now has well over 1.1 billion edits and 86 million content pages. This accurate and well-structured knowledge graph can then be used by Wikipedia to populate its infoboxes. If you would like to help populate Wikidata, check out their tutorials here.
On the Wikidata page for Bennachie you can see the edit options and links to references as provenance.
Looking at Aberlour distillery for comparison to DBpedia, there is no mention of the number of stills.
Moving on from encyclopaedic knowledge graphs, the ability to map is common in any project that includes locations.
Great Britain’s national mapping agency provides the most accurate and up-to-date geographic data. To help “make public data public”, they released a number of their products as linked data.
To keep with the pattern, here is Bennachie:
To use Ordnance Survey Linked Data, you can query its SPARQL endpoint here: http://data.ordnancesurvey.co.uk/datasets/os-linked-data
The Scottish Government manages statistics.gov.scot to provide public access to the data behind their official statistics in linked open data format. It contains 289 datasets from The Scottish Government themselves, The Scottish NHS, The National Records of Scotland, and Transport Scotland. Topics range across health, environment, crime, economy, transport, children, business, housing, etc… so this is a fantastic source of data for keen analysts and data scientists!
Bennachie is not in any of these datasets but you can dive into the stats in the browser though (and download anything in n-triples, a linked data format). For example, here is the number of children aged 0–19 in low income families across Scotland:
The Scottish Environment Protection Agency (SEPA) has also released a large amount of water related data as linked open data. You can access statistics about bodies of water, river basins, catchments, etc…
If we look up “Niddry Burn” for example, we can see that SEPA has connected its data to Ordnance Survey postcode districts (top two rows in the ?object column). This is one of the benefits of linked data. By linking to other linked open datasets, information can be analysed across graphs to find interesting connections without replicating any data.
Finally, we can explore this connectedness by looking at a couple of knowledge bases that connect strongly with DBpedia.
The first is Yago, containing more than 120 million facts extracted from Wikipedia. These facts tend to be a lot more specific than base DBpedia, adding a detailed taxomic backbone (especially spatial-temporal dimensions).
Next we have Chaudron, extending over 480,000 DBpedia resources with 950,000 measurements. These are physical measurements such as heights of dams or length of railways.
Unfortunately, Chaudron’s SPARQL endpoint is currently offline. You can still download its linked data dumps from their website though. With these, you can load them into any triplestore to query. You can query across loaded data and DBpedia using federated SPARQL queries (see query 7 here).
This is of course not an exhaustive list, the Linked Open Data (LOD) Cloud is huge so go exploring!
With these datasets you could try some simple analysis. For example, are there population discrepancies between DBpedia and Wikidata? Spoiler, there are. Or maybe you want to look at the connections between education and health with the SIMD data on statistics.gov.scot?
Going further, you could try building a chatbot with a linked data knowledge base (to make them more dynamic).
To do any of this though, you will need an introduction to the linked data query language, SPARQL