From SQL Tables To Sparql Graphs

Published in

The Cagle Report

11 min readFeb 19, 2017

When describing semantic technologies to developers, one area that causes immediate confusion is what benefit RDF has over SQL data. Both do joins, after all, both are “relational”. Why spend time learning another query language when a programmer has already mastered SQL databases?

From RDBMS to RDF

Before beginning, it’s worth clarifying some terms. A relational database management system (RDBMS) typically allows for multiple databases, each of which has multiple tables with field columns and data rows. The language used to access the data in these tables is called SQL (structured query language), and had been around in one form or another since Ted Codd first implemented the foundations of this in the late 1970s for IBM (later codified into SQL by Larry Ellison and Oracle).

A semantic database system also allows for multiple databases, each of which are called “graphs”. Each graph has a potentially huge number of assertions, statements consisting of a subject, predicate and object. If you had a database Table name Employee:

* Employee
*+---------+--------------------+------------+
*| EmpId   |  EmpName           |  DeptId    |
*+---------+--------------------+------------+
*|  101    |  Jane Doe          |  17        |
*+---------+--------------------+------------+
*|  102    |  Graham Kerr       |  40        |
*+---------+--------------------+------------+
*|  103    |  Sydney Green      |  17        |
*+---------+--------------------+------------+
*|  104    |  Jason Morgan      |  40        |
*+---------+--------------------+------------+

then an assertion that employee 101 has the name Jane Doe would be given as a “triple” that looks something like:

Employee:101 Employee:EmpName "Jane Doe"

where Employee: here is the table name. Thus the whole table would be represented as four such triples:

Employee:101 Employee:EmpName "Jane Doe".
Employee:101 Employee:DeptId Department:17.
Employee:102 Employee:EmpName "Graham Kerr".
Employee:102 Employee:DeptId Department:40.
Employee:103 Employee:EmpName "Sydney Green".
Employee:103 Employee:DeptId Department:17.
Employee:103 Employee:EmpName "Jason Morgan".
Employee:103 Employee:DeptId Department:40.

Here, Department:17 is a foreign key reference to a row with key 17 in the Department table.

* Department
*+---------+--------------------+-------------+
*| DeptId  |  DeptName          |  DeptMngrId |
*+---------+--------------------+-------------+
*|  17     |  IT                |  103        |
*+---------+--------------------+-------------+
*|  40     |  Accounting        |  104        |
*+---------+--------------------+-------------+

This can similarly be decomposed into triples as:

Department:17  Department:DeptName    "IT".
Department:17  Department:DeptMngrId  Employee:103.
Department:40  Department:DeptName    "Accounting".
Department:40  Department:DeptMngrId  Employee:104.

In other words, you can think of a triple as being a fully decomposed database row field value. When decomposed this way, the triples collectively make up a language called the Resource Description Framework (RDF). This is really ALL that RDF is at its core — a way of representing database triples. Well, almost all. There is the matter of schema.

Normally schema is used to set up a table in a database. For example the above employee table may be set up as:

CREATE SCHEMA Company;  
  
CREATE TABLE Company.Employee   
(EmpID int NOT NULL,  
EmpName char(20) NOT NULL,
DeptID int NOT NULL
)    
CREATE TABLE Company.Department   
(DeptId int NOT NULL,  
DeptName char(20) NOT NULL,
DeptMngrId int NOT NULL
)

In RDF, on the other hand, the schema itself can also be represented as triples:

graph:Company a class:Graph.
graph:Company {# Employee is a class
    Company:Employee     a owl:Class.# Department is a class
    Company:Department     a owl:Class.# EmpName is a property attached to employee and expects a string.    Employee:EmpName     a owl:Property.
    Employee:EmpName     rdfs:domain Company:Employee.
    Employee:EmpName     rdfs:range xsd:string.# DebtId is a property attached to employee and expects a link to a 
# department identfier.    Employee:DeptId      a owl:Property.
    Employee:DeptId      rdfs:domain Company:Employee.
    Employee:DeptId      rdfs:range Company:Department.# DeptName is a property attached to department and expects a string.    Department:DeptName    a owl:Property.
    Department:DeptName    rdfs:domain Company:Department.
    Department:DeptName    rdfs:range xsd:string.
    Department:DeptName    # DebtMngrId is a property attached to employee and expects a link to a 
# department identfier.    Department:DeptMngrId    a owl:Property.
    Department:DeptMngrId    rdfs:domain Company:Department.
    Department:DeptMngrId    rdfs:range Company:Employee.}

The “graph” in this case is the analog of the “Company” schema. Two classes are defined, Employee and Department, and for each two properties are defined. The domain makes explicit what the type of the subject is (what table the rows are from) while the range indicates either the data type of the value or the table of the associated foreign key’s primary key.

The final piece that ties this information together is then an assertion that identifies a given row with its table:

Employee:101 a Company:Employee.
Employee:102 a Company:Employee.
Employee:103 a Company:Employee.
Employee:104 a Company:Employee.
Department:17   a   Company:Department.
Department:40   a   Company:Department.

The whole thing, given collectively, completely describes the database:

graph:Company a class:Graph.
graph:Company {# Employee is a class
    Company:Employee     a owl:Class.# Department is a class
    Company:Department     a owl:Class.# EmpName is a property attached to employee and expects a string.    Employee:EmpName     a owl:Property.
    Employee:EmpName     rdfs:domain Company:Employee.
    Employee:EmpName     rdfs:range xsd:string.# DebtId is a property attached to employee and expects a link to a 
# department identfier.    Employee:DeptId      a owl:Property.
    Employee:DeptId      rdfs:domain Company:Employee.
    Employee:DeptId      rdfs:range Company:Department.# DeptName is a property attached to department and expects a string.    Department:DeptName    a owl:Property.
    Department:DeptName    rdfs:domain Company:Department.
    Department:DeptName    rdfs:range xsd:string.# DebtMngrId is a property attached to employee and expects a link to a 
# department identfier.    Department:DeptMngrId    a owl:Property.
    Department:DeptMngrId    rdfs:domain Company:Department.
    Department:DeptMngrId    rdfs:range Company:Employee.# Identify Instances
    Employee:101 a Company:Employee.
    Employee:102 a Company:Employee.
    Employee:103 a Company:Employee.
    Employee:104 a Company:Employee.    Department:17   a   Company:Department.
    Department:40   a   Company:Department.#Identify Properties    Employee:101 Employee:EmpName "Jane Doe".
    Employee:101 Employee:DeptId Department:17.
    
    Employee:102 Employee:EmpName "Graham Kerr".
    Employee:102 Employee:DeptId Department:40.
    
    Employee:103 Employee:EmpName "Sydney Green".
    Employee:103 Employee:DeptId Department:17.    Employee:104 Employee:EmpName "Jason Morgan".
    Employee:104 Employee:DeptId Department:40.    Department:17  Department:DeptName    "IT".
    Department:17  Department:DeptMngrId  Employee:103.
    
    Department:40  Department:DeptName    "Accounting".
    Department:40  Department:DeptMngrId  Employee:104.}

By itself, this is actually a pretty huge thing. This small file by itself describes the whole database, with a couple of exceptions. I’m deliberately handwaving the maxLength property — it is possible to define constraints in RDF, but there are actually several different ways that this can be done, from OWL restrictions to SHACL, that are not worth getting into here (though see my article (SHACL article)). This means that if you get this file from one RDF database and pass it to another RDF database by a different vendor, it will recreate the database exactly, something that is surprisingly difficult to do in the SQL relational world.

SQL to SPARQL

To query an RDBMS, SQL is the near universal language, although it suffers significantly from different vendors having code that varies just enough to make migration from one system to another exciting. By abstracting away the database, SPARQL simplies that particular problem, though there are still some variations on extension functions that need to be factored in.

SPARQL is short-hand for the recursive SPARQL RDF Query Language, and it is what SQL would have been had it come of age in the Internet. Structurally, it employs JOINS, just as SQL does, but how it expressed those joins is significantly different. For instance, suppose that you wanted to retrieve for each employee the name o that employee and the name of the department that he or she works in. In SQL this query is relatively simple:

select EmpName,DeptName from Company.Employee, Company.Department where
     Employee.DeptId = Department.DeptId

A SPARQL Query looks superficially similar:

select ?empName ?deptName from graph:Company where {
     ?employee a Company:Employee.
     ?department a Company:Department.
     ?employee Employee:DeptId ?department.
     ?employee Employee:EmpName ?empName.
     ?department Department:DeptName ?deptName.
     }

What’s different here is that each line in the query is a pattern to be matched from the databases. If there is a triple that follows that pattern in the database, then it gets added to a consideration set. If all of the patterns are met then the variables that represent values that are met are compared with the selection list and the items in that list are returned. The result of this is a table. For the above, this would look something like

* Results
*+------------------+------------------+
*|  empName         |    deptName      |
*+------------------+------------------+
*| "Jane Doe"       |      "IT"        |
*+------------------+------------------+
*| "Graham Kerr"    |   "Accounting"   |
*+------------------+------------------+
*| "Sydney Green"   |      "IT"        |
*+------------------+------------------+
*| "Jason Morgan"   |   "Accounting"   |
*+------------------+------------------+

In practice, the result is usually returned as an array of JSON objects instead:

[
    {"empName":"Jane Doe",
     "deptName":"IT"
    },
    {"empName":"Graham Kerr",
     "deptName":"Accounting"
    },
    {"empName":"Sydney Green",
     "deptName":"IT"
    },
    {"empName":"Jason Morgan",
     "deptName":"Accounting"
    },
]

Another way of thinking about all of the lines in the query matching is that a path was created in the graph that included all of the items. For “Jane Doe” as an example, the path looks something like this:

select (c)"Jane Doe" (d)"IT" from graph:Company where {
     (a)Employee:101 a Company:Employee.
     (b)Department:17 a Company:Department.
     (a)Employee:101 Employee:DeptId (b)Department:17.
     (a)Employee:101 Employee:EmpName (c)"Jane Doe".
     (b)Department:17 Department:DeptName (d)"IT.
     }

This means that if you put each (a) on the left side with another (a) on the right, (b) on the left with (b) on the right and so forth, you’ll end up creating a graph where every node (subject or object) is connected, as shown below:

This connectedness is why these result sets are known as “graphs”.

Inferencing and OOP

So far, making the jump from relational to graph data seems like it might offer some advantages for portability of data, but the queries are (at first glance) more complex. Where semantics begins to shine is in the arena of inheritance and inferencing.

For instance, consider the graph above. There are two classes that are specified — Company:Employee and Company:Department. It is possible to think of these as classes of Things within the Company graph (or namespace), i.e., Company:Thing. Within RDF, it’s possible to identify a subclass relationship, using rdf:subClassOf predicate:

Company:Employee rdf:subClassOf Company:Thing.
Company:Department rdf:subClassOf Company:Thing.

By itself, this doesn’t necessarily do much of anything. However, it’s worth noting that both an employee and a company have names. You could thus create a new property called “name” that is something intrinsic to any “thing” within a company:

Thing:name a  owl:Property.
Thing:name rdfs:domain Company:Thing.
Thing:name rdfs:range xsd:string.

Now, one of the other things that differentiates a Triple Store and RDF from a relational database is the idea of controlled inferencing. There are certain relationships that are intrinsic to RDF databases — such as subClassOf relationships. That this means is that, if a class is a subclass of another class, then the first class inherits all of the properties of the second. This means that I can replace Employee:EmpName and Department:DeptName with Thing:name

Employee:101 Thing:name "Jane Doe".
Department:17 Thing:name "IT".

Note that “Thing” does not exist as a separate table — it is, in effect, an abstract table.

Now this requires that you add a Thing:name property, so end up with some redundancy. However an alternative (and something that is very useful when dealing with data from relational databases) is to identify predicates as being subProperties of other properties.

At this stage, there is no equivalency between Thing:name and Employee:EmpName, but, you can use the “rdfs:subPropertyOf” relationship, which does for properties what “rdfs:subClassOf” does for classes:

Employee:EmpName rdfs:subPropertyOf Thing:name.
Department:DeptName rdfs:subPropertyOf Thing:name.

This means that, if the rules for inferencing are set up, you can query your graph as follows:

select ?empId from graph:Company where {
    ?empId Thing:name "Jane Doe"
}=>  [{empId: <Employee:101>}]

Where the <> brackets indicate that this is a pointer or semantic identifier. What this means in practice is that if you have a large number of properties that do much the same thing across different classes then you can use a single predicate for all of them. The skos:prefLabel predicate is often used for this purpose, as is rdfs:label.

This is also very useful when you are working within information across ontologies (such as a sales database as opposed to a company database). The owl:sameAs property can be used to indicate when two different resources point to the same person or thing.

Employee:101 owl:sameAs Salesperson:JaneDoe.
Employee:101 Thing:name "Jane Doe".
Salesperson:JaneDoe Sales:2017Revenue "20000000"^^currency:USD.

This means that a SPARQL query looking for data from one kind of record can also reach data from other records where the sameAs relationship is asserted:

select ?name ?2017Revenue from Graph:Company where {
     ?emp Sales:2017Revenue ?2017Revenue.
     ?emp Thing:name ?name.
}=> [{name:"Jane Doe",2017Revenue:{value:"20000000",datatype:<currency:USD>}}]

This gets to the heart of one of the most complex problems facing relational databases — most SQL databases are closed — they know only about their internal keys, but not about external keys. Thus they require complex master data management (MDM) solutions to try to keep everything in sync.

In RDF and SPARQL, MDM is more or less out of the box, because of owl:sameAs and similar inferencing. This makes semantic solutions much more “enterprise ready”, especially given that many SPARQL databases emerged out of government research work where security was already an implicit concern. Graphs have the ability to maintain their own permissions, and as such queries will return only those subgraphs that the user has permission to see. This can make for literally property specific security, something that is generally difficult to achieve in most databases without radically changing the way that SQL works.

Graph Databases vs NoSQL

Graph databases have a tendency to get chunked into the NoSQL Database category. While this is technically true (a SPARQL database, especially, is not a SQL database) it should be noted that an RDF SPARQL database can be configured to handle handle JSON or XML input or output.

On the XML side, SPARQL has an XML based language format called RDF-XML. Typically, in an XML pipeline, an inbound XML document will usually be paired with an XSD schema. This schema in turn can be used to generate an XSLT that will map the XML into the RDFXML format, which can in turn be ingested and saved as triples (possibly with the XML itself being retained for fast retrieval). This makes it possible to both get rapid retrieval of content and to search that content internally as SPARQL.

A similar mechanism is used with inbound JSON content, which can both be deconstructed as a reference-able tree and stored as a string representation. The exact construction of such a tree internally as beyond the scope of this article, but it can be done.

Outbound, most triple stores and graph stores can generate output in a wide number of formats from different forms of XML and HTML to JSON, CSV and even binary formats. Turtle and TriG (an extension of turtle, the notation used here) is also becoming increasingly common as an interchange format between different triple stores.

One final (and important) note. In 2013, the W3C released the 1.1 Sparql Update specification, which makes it possible to write scripts for updating triple content in a standardized way. Up until then, this capability was largely defined by the vendor. Most contemporary triple stores now support this, making it possible to make triple store to triple store ETL far simpler.

Wrap-Up

SQL-based RDBM systems are still a good solution for many, many problems. However, increasingly RDF-based SPARQL solutions can provide the same capabilities as SQL, are better at managing Internet based communication, have the ability to use inferencing to identify and integrate common properties across data systems, and can frequently work better than RDBMS’s in those situations where you need to share data securely and compellingly across enterprises. Every CIO and CDO owes it to themselves and their companies to investigate how graph databases can solve their broader data problems.

Kurt Cagle is an author and information architect specializing in NoSQL and Graph data systems.