Linked Data: what is it? What does it do? Does it do things?
Linked Data has been around for ages. It is composed of a suite of tools, frameworks and standards to work with data in ways that enable its shareability, reusability and machine readability.
The term was coined by Tim Berners-Lee in 2006 but references to the conceptual model, thinking about data in this way and relating standards can be found as far back as 1999.
While Linked Data has not been adopted as widely as it was envisioned (due to a slew of reasons I can only speculate about), standards like RDF are widely used. Initiatives like schema.org promote the machine readability of website contents and allow for the creation of smart search engine results like those Google recipe cards.
So, what’s RDF?
RDF stands for Resource Description Framework, and it is a standard for representation and interchange of data.
Its defining characteristic is that it represents data as statements composed of three parts, subject, predicate and object. A triple.
Another defining characteristic of Linked Data is that all entities are represented by a URI, a Uniform Resource Identifier. This means the same entity can be used by different people for different purposes -but as long as they have the same URI, they will mean the same thing.
Groups of triples form a directed graph structure that describes entities using named links. Now one of the interesting things about RDF is that every component of the triple can be an entity, so it can be the subject of other triples.
For example in the triple:
dbr:Bob_Marley foaf:knows dbr:Lee_Scratch_Perry .
We are declaring that there is a subject dbr:Bob_Marley connected to an object dbr:Lee_Scratch_Perry through the predicate foaf:knows .
Much like most humans, both Bob Marley and Lee Scratch Perry have names, so we can describe them:
dbr:Bob_Marley rdfs:label "Bob Marley" .
dbr:Lee_Scratch_Perry rdfs:label "Lee ‘Scratch’ Perry" .
But what’s not so obvious is that foaf:knows can also be linked to properties:
foaf:knows rdfs:label "knows" ;
rdfs:comment "A person known by this person (indicating some level of reciprocated interaction between the parties)." ;
rdfs:domain foaf:Person ;
rdfs:range foaf:Person ;
rdfs:isDefinedBy <http://xmlns.com/foaf/0.1/> .
The syntax in the examples above is called Turtle and can be explored further in this article published by our resident research nerd, Angus Addlesee.
Without going into too much depth in the FOAF (Friend of a Friend) Ontology, the example above explains the foaf:knows predicate by indicating a human readable name (label), and a human readable description (comment). rdfs:isDefinedBy indicates that it is described in the FOAF ontology, and finally rdfs:domain and rdfs:range explain that foaf:knows expresses a relationship between two foaf:Person s (yes, foaf:People isn’t a thing, we’ve already gone over that in my last article, get over it).
This is the data itself explaining what it is, not a separate dataset in a different format that explains the schema. Linked data can describe itself.
If you want to know more about RDF, Angus had a look at it in the context of Big Data.
Why should I care?
Linked Data has been used by major software companies for a number of years now (see Google example above), but it has not been widespread and a lot of companies are still not publicly saying they use linked data, which slows its adoption.
But things are changing. Amazon just released a triplestore, Neptune (which they call a graph database, but c’mon Amazon, it’s a triplestore…), eBay is also working on a triplestore, Beam (which they’re calling a knowledge graph store…). Forbes is writing articles about RDF (semantic graphs… there’s a pattern emerging here, why is everyone afraid of Linked Data?)
And there’s been a lot of talk about a proposed way of building web applications using linked data called SOLID. But more on that later.
People are realising that a lot of the challenges that we are facing nowadays can be tackled by using linked data.
Companies like Google are popularising the idea of the One Knowledge Graph to rule them all™ and companies like Facebook are showing the potential of looking at your data as a graph rather than rows in a table.
Linked Data can be either or both. On top of that, there’s a standard that comes with it, which means your data is not vendor-locked. There’s a very expressive query language (SPARQL) that has existed for almost 15 years now. And the average machine is now powerful enough to cope with the inherent processing requirements of routine graph operations.
Shifting smarts into data
Data is not meant to be intelligent. Data is meant to be just that, data. Algorithms are meant to be smart.
But what if data were smarter? What if some of the responsibility was shifted into the data? Surely we could have simpler algorithms and simpler applications doing the smart things.
We could have the same algorithms doing smarter things!
Linked Data comes with another accompanying standard OWL (Web Ontology Language), which is a set of languages/frameworks to define ontologies. What are ontologies? Essentially a formal way of defining entities in a graph and explaining how they relate to each other.
When you define your data in this way, you open a world of possibilities. By defining how data should look and how it should relate to other data, we get access to things such as inference, which allows us to draw conclusions from the data without having to explicitly define them.
“What?” — Everyone
Let’s look at an example:
dbr:Alex dbo:hasChild dbo:Jackie .
dbr:Alex dbo:hasChild dbo:Annete .
We see that Alex has 2 children. Jackie and Annete are therefore siblings. This is easy for humans to spot, but hard for machines.
OWL allows us to express that when two people share a parent, they are siblings. So even if our data doesn’t explicitly say that, inference lets us know that the following triples exist implicitly:
dbo:Annete dbo:hasSibling dbo:Jackie .
dbo:Jackie dbo:hasSibling dbo:Annete .
So we don’t need to explicitly declare in the data that Jackie and Annete are siblings to use that information when querying the data.
“That’s cool!” — Everyone
I know right?
How is this used then?
Linked data can be used just like any other type of data storage. There is a whole ecosystem of triplestores, libraries and frameworks to work with (Apache Jena is the real MVP), and new ones are being created.
So to start building a linked data application you would start like any other application. You still have to design your data model (this will hopefully be a future article), define your business rules, and build the application. But knowing that your data has this capability means that there will be less things to worry about in your application.
It also means that your logic becomes reusable across applications when it’s shifted to the database.
You might feel tempted to think about coupling, and how your logic should be in your programme, not the database, because it makes the programme dependent on the database. That is true.
It is also true that because it is all standards-based and open, coupling with the database is not the same thing as being vendor-locked. You can switch triplestores at any time.
Can you switch back to relational or graph? Probably not, but it’s not easy to switch between relational and graph either, so paradigm shifts always come with a lot of pain. Realistically, this is not a problem with triplestores, rather a problem with the lack of standards which means the data is not inter-operable.
And we’re all about the real.