Building Semantic Data Catalogs

I had a recent conversation with a senior data manager for a large corporate enterprise who is in the process of setting a long term data strategy. We had been talking semantics and RDF, and he raised an issue that I’ve run into as a consultant before, running something like this:

“We have a huge amount of siloed data, and would prefer to not have to move or duplicate that data but still need to get at it.”

This is a reasonable assumption. Databases represent a major sunk cost for any company, especially global ones, and any solution that requires relocating data (and potentially impacting thousands of existing applications) is a non-starter. At the same time, most data systems, whether relational databases, NoSQL databases, spreadsheets, documents or similar content, are not in fact designed out of the box to be used globally, and often there are complex ETL costs associated with making that data available through a queryable interface.

Putting everything into a triple store by itself is not an ideal solution, any more than dumping everything into a relational database or a NoSQL solution is. The fundamental challenge that you face in all of these cases is that the more content you bring into the database, the more that content needs to be indexed, to the extent that even with an efficient index, performance degrades significantly. (This is true of any database, by the way, not just triple stores).

One strategy that is beginning to gain traction is the notion of what’s called a semantic data catalog. This is an idea that is based upon conceptual resources and REST. In effect, every resource of interest to an organization exists as a certain type (such as an employee, a product, a location, …). Depending upon the granularity of the model and the size of the organization, there may be hundreds of these types, but there is typically an inheritance structure that can create a general taxonomy of entity types (such as an aircraft being a subclass of a product, which is a subclass of an entity).

A typical relational database is not set up with entities in mind, but rather with tables, where each table can represent a list of entity instances (specific aircraft, for instance). This contains specific information about each aircraft.

@prefix aircraft: <https://www.example.com/meta/aircraft#>.
aircraft:_17
a aircraft:Class;
aircraft:name "Boeing 747-8F"@en;
aircraft:type term:_aircraftType_7469F;
aircraft:length "250"^^unit:_Length_EngFoot;
aircraft:wingspan "225"^^unit:_Length_EngFoot;
aircraft:topspeed "614"^^unit:_Velocity_mph;
aircraft:unitCost "403.6"^^currency:_USD_millions;
aircraft:engineType term:_AircraftEngineType_GEnx;
aircraft:manufacturer org:_Company_Boeing;
aircraft:firstFlightDate "2010-02-08"^^xsd:date;
.
# Information courtesy Google

The format is (as those who read my column may know by now) Turtle, or Terse RDF Language. A real record may be considerably longer and more complex, of course, but this is enough to illustrate the point. It is typical of the type of information you would expect in a triple store. The URI identifier for this particular aircraft instance is in the first line: aircraft:_7478f1abc. There’s nothing really magical in this identifier; the real magic is the fact that it would be considered globally unique, through the magic of namespaces, which would render the condensed URI form (or curie) as the string:

<https://www.example.com/meta/aircraft#_7478r1abc>

In a perfect world, all of this information would be contained within one database table. Alas, the world is almost inevitably not perfect, and this kind information more than likely originated from two or more tables, possibly even different tables from different databases.

A data source could be a relational database, a spreadsheet, an XML document, a CSV, JSON, or some similar serialization. When pulled over, each cell could be described by the following statements:

cell:_aircraftExcel12345_sheet1_b2
a cell:Class;
cell:hasRow row:cell:_aircraftExcel12345_sheet1_2;
cell:hasColumn column:_aircraftExcel12345_sheet1_AircraftID;
cell:hasValue "17"^^xsd:integer;
.
cell:_aircraftExcel12345_sheet1_c2
a cell:Class;
cell:hasRow row:_aircraftExcel12345_sheet1_2;
cell:hasColumn column:_aircraftExcel12345_sheet1_Name;
cell:hasValue "Boeing 747-8F";
.
cell:_aircraftExcel12345_sheet1_d2
a cell:Class;
cell:hasRow row:_aircraftExcel12345_sheet1_2;
cell:hasColumn column:_aircraftExcel12345_sheet1_Length;
cell:hasValue "250"^^xsd:integer;
.
row:_aircraftExcel12345_sheet1_2
a row:Class;
row:hasSheet sheet:aircraftExcel12345_sheet1;
.
column:_aircraftExcel12345_sheet1_AircraftID
a column:Class;
column:name "Aircraft ID";
column:type term:_ColumnType_PrimaryKey;
column:inSheet sheet:_aircraftExcel12345_sheet1;
.
column:_aircraftExcel12345_sheet1_Name
a column:Class;
column:name "Aircraft Name";
column:inSheet sheet:_aircraftExcel12345_sheet1;
.
column:_aircraftExcel12345_sheet1_Length
a column:Class;
column:name "Length";
column:inSheet sheet:_aircraftExcel12345_sheet1;
.
sheet:_aircraftExcel12345_sheet1
a sheet:Class;
sheet:name "Summary";
sheet:inSource source:_aircraftExcel12345;
.
source:_aircraftExcel12345
a source:Class;
source:type term:_SourceType_Spreadsheet;
       source:name "Aircraft Descriptions";
source:resourceURL "https://www.example.com/src/AircraftDescriptions.xlsx";
source:extractionDate "2018-05-02T05:17:12"^^xsd:dateTime;
source:governedBy janedoe@myCo.com
.

This is staggered — there is one source (an Excel workbook), one or more sheets, one or more columns per sheet, and one or more cells per column, and one or more cells per row.

As it turns out, the cell values in general are likely going to be intermediate artifacts. There are several ways that this data can be used to create a map, not least of which being a SPARQL UPDATE script,

# This has IRI script:_AircraftGenerationScript
delete {?iri ?pOld ?oOld}
insert {
?iri a aircraft:Class;
aircraft:name ?name.
aircraft:length ?length.
entity:source ?row.
entity:transform ?script.
}
where {
?row a row:Class;
?row row:hasSheet ?sheet.
?column column:hasSheet ?sheet.
?cell cell:hasRow ?row.
?cell cell:hasColumn ?column.
?sheet sheet:name ?sheetName.
?sheet sheet:hasSource ?source.
?source source:name ?sourceName.
optional {
?column column:hasName "Aircraft ID".
?cell cell:hasValue ?id;
bind(concat(str(aircraft:),"_",str(?id)} as ?iri)
}
optional {
?column column:hasName "Aircraft Name".
?cell cell:hasValue ?name;
}
optional {
?column column:hasName "Length".
?cell cell:hasValue ?lengthScalar;
bind(unit:_Length_EngFoot as ?lengthUnits)
bind(strdt(?lengthScalar,?lengthUnits) as ?length)
}
bind (script:_AircraftGenerationScript as ?script.)
?iri ?pOld ?oOld.
values {?sheetName} {"Summary"} 
{?sourceName} {"Aircraft descriptions"}
}
}

This creates a short version of the previous definition given above, but with two additions, the triples:

aircraft:_17 entity:source row:_aircraftExcel12345_sheet1_2;
entity:transform script:_AircraftGenerationScript.

What this does is link the canonical entry in the database to its source, and indicates how it was transformed (this could very readily by an IRI for a script object that also contains metadata).

Why is this important? Because by containing the column, sheet and source information, along with a type indicator of some sort, and a map between a source and target resource identifiers, you have what amounts to a data catalog. You have a mechanism for importing triples from source files at the time of querying, which can then be put into an intermediate graph, queried, and cached. Once those triples become stale, then the graph is deleted.

One additional benefit that derives from this — with the catalog entries, you can effectively pick and choose what information you want to work with, then let the system retrieve the data from the appropriate systems without you knowing what the source systems are.

It’s worth noting that there is a fair amount of hand-waving here. Some systems allow for inline evaluation of sparql scripts, some don’t. However, what rises from this is that data catalogs provide a way to perform complex queries across potentially dozens of systems without having to overwhelm your indexes, making it especially useful for organizations that already have a solid established data footprint.

Data catalogs also resolve another problem. Transformations are not always reversable. If, for instance, a transformation creates an attribute with different values based upon the state of two or more variables, disentangling that logic (which is not purely functional) can be extraordinarily complex if not outright impossible (for instance, calculating the average from a set of values and passing that average as the value of an attribute). However, knowing the transform, you can recalculate that attribute should something change in the source, as you have the transformation and the associated target property.

This also lets you deal with unit conversions. For instance, pulling data from a relational database, it is not at all uncommon to have no idea what units are involved — this is something that actually had to be determined manually. For instance, in the initial example, the length of the aircraft is given by

cell:_aircraftExcel12345_sheet1_d2
a cell:Class;
cell:hasRow row:_aircraftExcel12345_sheet1_2;
cell:hasColumn column:_aircraftExcel12345_sheet1_Length;
cell:hasValue "250"^^xsd:integer;
.
column:_aircraftExcel12345_sheet1_Length
a column:Class;
column:name "Length";
column:inSheet sheet:_aircraftExcel12345_sheet1;
.

This has no units associated with it — there’s only an indication that the length is given in meters. However, in setting up the SPARQL transformation, you discover that the units are in feet. This is added in the insert section of the transformation by:

?iri aircraft:length ?length.

with the relevant part of the select clause looking like the following:

optional {
?column column:hasName "Length".
?cell cell:hasValue ?lengthScalar;
bind(unit:_Length_EngFoot as ?lengthUnits)
bind(strdt(?lengthScalar,?lengthUnits) as ?length)
}

This creates a scalar attribute, which, given the template, will produce the following output:

aircraft:_17 aircraft:length "250"^^unit:_Length_EngFoot.

For more on working with specified units in RDF, see my article Semantic Datatypes for Fun and Profit.

Finally, it should be worth noting that the source definition for your document, feed or database should include enough information to retrieve this content — this is essentially the credentials record for creating a connection. Most than likely, at least some of this will occur outside the realm of sparql queries/updates. Consequently authentication information shouldn’t be kept within the RDF database, but just about everything can be stored.

So, to recap, while it is possible (and in some cases desireable) to bring in everything into the database, by using a semantic data catalog approach, you can bring in intermediate information and transforming it, caching as appropriate for subsequent queries. You can also use the SPARQL UPDATE to create different views of the data, and it is a relatively small step from writing specialized queries to compositing transformations via a check box of “available” properties for a given resource. Additionally, by taking such an approach, data dimensions can be appended and (not shown here) foreign keys can be resolved, a topic I’ll reserve for another article.

Writer and futurist Kurt Cagle has been blogging and presenting about NoSQL database, data analytics, semantics and machine learning for a couple of decades now. Look for hash tag #TheCagleReport for more articles on these and other topics on medium.com.