Cool things I do with RDF

Dean Allemang
15 min readFeb 7, 2023

It’s no secret that my favorite technology for managing FAIR data, and for data sharing in general, is RDF. To start with, RDF is a graph data representation, which makes data merging a lot easier. RDF also has standard string serializations (a few of them tuned to different needs), which makes data publication a lot easier. With RDFS, you can even represent metadata in RDF, making your metadata findable and reusable.

But in my daily work, there are a bunch of cool things that I do with RDF on a regular basis that are pretty unusual, and difficult to do with other data representations, especially tabular things like CSV and Relational Databases. Here’s a list of some of those things:

  1. Take data from one database and load it into another. Because RDF is a standard, there are several implementations of RDF, you can export data from one database and import it into another, with no loss (or unwanted gain!) of information. There are lots of reasons why you might want to do this, ranging from licensing (one implementation is open source, another is proprietary), scaling (one database might be faster than another on certain problems), extra features (visualization, publication, query management, etc.), or just avoid vendor lock-in (you don’t like how this vendor treats you? Migrate your data to a new one!).
  2. From time to time, someone will tell me that they have some data in RDF, and they ask me to tell them what’s in it. Or nowadays, I’ll find one of the millions of datasets available on data.world or in the Linked Open Data Cloud, and I’ll want to find out about it. RDF is quite good at being self-describing, and SPARQL is a great way to research data in RDF, even if you know nothing at all about it before you start.
  3. This one is a bit more subtle, but is the keystone to a lot of the work I do. I’ll call it layering (meta-)data. I’ll build a model or a dataset that captures some aspect of a system, workflow or business rule. Then I’ll customize it by adding more detail, for a particular situation or customer. This process can go on and on; customizing for a particular customer, then customizing that for particular departments, then customizing again for different projects, keeping them all in alignment through the shared data or model.

I’ll do a deep-dive on each of these in turn.

Transferring data from one triple store to another

One of the beauties of having a data representation standard is that you can have multiple implementations of the same database standard, and migrate your data from one to the other without any loss of information or fidelity. In principle, this is possible with relational databases; after all, the mathematics of the relational algebra lets us prove that the data in one system is the same as the data in another, so we should be able to migrate from, say, Oracle to DB2 to Postgres to MySQL as much as we like, with no loss of data, and with little effort. In one of my earlier jobs, I used to work for someone who had started her career as the captain of one of IBM’s famous data teams, who would make DB2 do whatever you needed it to do. I naively asked her what it would take to migrate a customer’s Oracle deployment to DB2. She giggled and said, “I’ll be happy to prepare a quote for doing that work.” Those teams don’t come cheap, and it takes a while to schedule them.

Pouring data from one RDF triple store into another.

With RDF triple stores (that’s what we call a database for storing, merging and querying RDF data), this is quite a routine thing to do. If someone quotes you a project to do this, and they charge more than a few minutes of their hourly rate, then you are being ripped off. Since it is so easy to do, I find myself doing it just about every day. What are the motivations for doing this? There are a lot of them, applicable in a variety of contexts.

  1. From the customer’s perspective, the motivation for this is to avoid vendor lock-in. If they pay money for a triple store, they’d like to know that they can migrate from one to the other without a large re-investment. A common pattern in this situation is to begin with one of the open-source triple stores, and migrate to an enterprise level or cloud system once the project has proven itself successful.
  2. While the data representation of all the triple stores is the same, it is common for each of them to have some extra functionality that is useful for one thing or the other. One of the reasons I like to use different system is the way they format data. Many systems are good at storing and querying RDF data, but not as good when it comes to formatting the data. If you’re someone like me who wants from time to time to look at their data in a file, this isn’t very useful. Jena is much better for that. So I often download data from one system (data.world is very good at making its data available for other systems to use), and have jena format it.
  3. Another thing that the various triple stores have different strengths in is data visualization. The RDF standard doesn’t give any advice about how to visualize data, but this hasn’t stopped a number of vendors from providing visualization tools. TopBraid Composer has a great tool for visualizing metadata; if your RDF file contains RDFS metadata (or OWL metadata), you can use TopBraid Composer to view these in a UML-like display format. At data.world, we have a tool called gra.fo for visualizing and editing metadata in RDFS.
    Allegrograph has a wonderful tool called Gruff that does a great job of visualizing RDF data itself. You can check each node to see what edges come from it, then visualize the ones you choose, and keep going until you have a nice picture of an interesting subset of your graph.
  4. Different triple stores have different performance profiles. Not surprisingly, triple stores targeted for enterprise use can scale to much larger volumes of data than an open-source triple store on a laptop. But even different enterprise level systems have different performance profiles. For example, data.world is very good at managing a large number of distributed datasets (the public site at data.world has half a million data sets; you can query any subset of them you like), whereas other triple stores specialize in indexing large amounts of data in a single data source.
  5. Something I do very often is to find one or more datasets, run some queries over them, merge the output of those queries with more data, run some queries over that, and repeat many steps along until I have a new dataset that represents a bespoke analysis of the original data sets. Basically, I want to write a program using SPARQL queries in a particular sequence. For testing this sort of thing out, Jena’s command-line tools are great; you can manage your queries and datasets as files, and write your program in bash. For more serious applications, if you like programming in python, RDFLib is a great system for programming this sort of thing. For Java fans, Jena’s java API is used in all sorts of RDF systems, and has been well-tested.

The interoperability of RDF triple stores isn’t just something theoretical; it works in practice, literally every day. It is pretty easy to check to make sure that the data you loaded in one store is the same as what you had in another (a quick checksum is simply to count the number of triples; while that doesn’t guarantee that the two datasets have identical content, it is a very easy measure and is very unlikely to give a false positive).

Here is what I think is a comprehensive list of triple stores I have used in this way. I won’t guarantee that I have done a round-trip test of every pair of stores in this list, but I have certainly exchanged RDF data between each of these and at least one other. Normally, a list of “tools that I have used” doesn’t count as an endorsement; but in this case, it does. I am willing to endorse every one of these tools as a good choice for managing data in RDF. Just because I left something out here doesn’t mean it isn’t good; it just means that I haven’t (yet) had the opportunity to do a project with it.

  1. data.world It should come as no surprise that I would list this one first. This is the quickest way to get from “hey, I’ve got some RDF data” to “writing a query” out there. No downloads, nothing to install, no need for a particular runtime engine. Just login with your github, twitter or facebook account, upload or link to the data, and start querying.
  2. Jena. This was used as a reference implementation during the RDF working group’s development. Open source, command line and API versions, persistent database as well as in-memory.
  3. RDFLib. Python API, very intuitive and pythonic. Also open source.
  4. Allegrograph. Enterprise level triple store with great management of processors and memory.
  5. GraphDB. A solid triple store with a great pedigree.
  6. Cambridge Semantics Anzo. Comprehensive knowledge graph platform with support for database connections and running sequences of queries.

Exploring RDF with no prior knowledge

Every once in a while, someone will point you to a dataset, and ask you what it’s about. More often, you’ll be browsing around for data, and find some. If this is a relational database, you can hope that someone gave you a database schema or a data dictionary, so that you know at least what tables are in it. If they don’t, well, I don’t know what you do. I occasionally hear someone say, “RDF is so hard. I don’t know where to start, unless someone gives me an ontology, and I don’t understand ontologies, so I’m going to give up. It’s just too hard!!”

I have always found this point of view very odd. First off, ontologies aren’t any harder than database schemas, and they even have the advantage that there is a standard language for them. But more importantly, you don’t need to know anything about an ontology to understand RDF data (this goes for any graph data, actually). But how do you start? Let’s give it a try.

An explorer discovering an ancient temple in a jungle
Exploring datasets can be as much fun as exploring thejungle, and a lot less dangerous!

I decided to go through some of the data that I used in Semantic Web for the Working Ontologist, and find a dataset that I had forgotten about. I decided on a dataset, but I’m not going to tell you anything about it, since that’s what this section is about. You can find it among the Working Ontologist data on data.world, where it is called VOCAB_QUDT-DIMENSION-VECTORS-v2.0.ttl. If you click through to that, it will ask you to login to data.world. You can use your github, twitter or Facebook account to do that, but it doesn’t cost anything, and data.world won’t send a sales associate after you. Think of it just like creating a github account; you can start browsing around, and you can even create your own content, but you don’t have to pay just for participating in the public data mission.

You could go ahead and look at this data, but let’s hold off on that. Suppose the file is too big to get a handle on, or you just don’t know how to read TTL. How can we find out what is in there?

Each of the following queries is a link to the live query on data.world; if you click through, you can see the query in action.

One of my favorite starting queries for any dataset is this one:

SELECT DISTINCT ?p
WHERE {?s ?p ?o}

What does this do? It matches all the triples in the dataset (?s p ?o), then shows just the predicates in those triples. The “DISTINCT” keeps us from seeing the same ones over and over again.

The result includes 21 properties. A large but not unmanageable number. There are some familiar properties in here, namely rdf:type and rdfs:label. How can we use these to our advantage?

First dozen answers to the “distinct properties” query

Let’s start with rdf:type. This property is usually used to map data items to their types (hence the name). So we can query for all the types that are used for data in this set:

SELECT DISTINCT ?type 
WHERE {?s rdf:type ?type}

This time, we get just four answers, so we can show them all:

All four answers to the “distinct types” query.

What do we know from this? First off, there are two familiar classes here from OWL, namely, owl:Class and owl:Ontology. This suggests that someone has documented this dataset as an Ontology in OWL, and we can just query for their description from there. But let’s wait on that; suppose this dataset didn’t have such nice landmarks in it. What else have we learned?

There’s a thing called a CatalogEntry, but what looks more interesting to me is this new thing, that is probably in the data domain itself; a thing called a Dimension Vector. What do we know about this? Well, we know that this dataset defines a bunch of these things, but how many? And what do we know about them?

SELECT (COUNT (?dv) AS ?many)
WHERE {?dv rdf:type <http://qudt.org/schema/qudt/DimensionVector> }

The result from this query is 194; this dataset defines 194 Dimension Vectors. We’re starting to understand it now; we don’t yet know exactly what a DimensionVector is, but we now have 194 of them. That’s too many to look at all at once; let’s see what sorts of data we have about any of them.

SELECT DISTINCT ?p
WHERE {
?dv rdf:type <http://qudt.org/schema/qudt/DimensionVector> ;
?p ?o .
}

This query is very like the first one, finding the predicates that are used in the data. But in contrast to the first query, this one is focused only on entities that have type DimensionVector; so we are finding which of our 21 properties are used to describe DimensionVectors.

The answer is pretty terse; just five properties are used for DimensionVectors:

Five properties used for DimensionVectors

So what have we learned so far? This dataset describes a bunch of things called DimensionVectors (194 of them), and each one has a description, a label, and two more things we haven’t heard of before; a BaseQuantityKind and a vectorMagnitude.

We can continue in this fashion; we have a lot of ways we can go from here. We could look at the labels and descriptions of the DimensionVectors, we could see what those BaseQuantityKinds look like, we could investigate those vectorMagnitudes. But we already know a lot; we might decide we know enough, and not pursue the investigation any further.

Back in the second query, I specifically chose not to investigate down the owl:Ontology lead; I did this on purpose, because it would make the game too easy. If we investigate that like we have DimensionVector, we pretty quickly find that there is just one Ontology in this dataset, and it has a description that says:

QUDT Dimensional units is a vocabulary that extends QUDT Units with properties that support dimensional analysis. A dimension is a relationship between a quantity system, a quantity kind of that system, and one or more dimension vectors. There is one dimension vector for each of the system’s base quantity kinds. The vector’s magnitude determines the exponent of the base dimension for the referenced quantity kind.

Well, that about sums up this dataset in a nutshell. We had learned about dimension vectors, and we knew that they had magnitudes and quantity kinds. This description gives us the context of those findings in some larger system (in this case, QUDT).

We started from this dataset knowing absolutely nothing about it; no class names, no namespaces, nothing. We didn’t even know which triple store it was running in (it happens to be data.world, which is the best way out there to publish queries in blog entries); these queries would work equally well in any triple store. When we finished, we knew the names of the things that are being described, what type of thing they are, and how many there are. This was all corroborated, when we found the documentation of the dataset left by its authors.

Layering Metadata

This takes advantage of how easy it is to merge data with RDF. When your data is in a graph, you can just superimpose one graph on another, to form a bigger graph. You can see this in the animation below; there are five small graphs that come from different places. There are some concepts that they share; when two graphs come together in the animation, the common concepts show up as a single node. When they all come together, they tell a story that is greater than the sum of the parts.

graph data from five sources animated to form one combined graph.

How do I use this capability in an enterprise data management setting? A typical problem in enterprise data management is that different parts of a large firm use different business vocabulary to describe their work. It is tempting to say that the company needs a project that will unite the vocabulary of the entire firm; get everyone to use the same vocabulary. This is an attractive idea, but fundamentally impractical. First off, it is a monumental task to get part of an organization to change how it describes its business. But more importantly, workflows, business forms, indexing methods and many other operations are already in place and working, using some current vocabulary. Neither part of the business is “wrong” in its use of vocabulary; each of them has a good reason to use vocabulary the way they do.

What is the alternative? Just let different parts of the company use terminology any way they choose, and just cope with the fact that one part of the company can’t communicate effectively with another? This is also fundamentally impractical. So what else can you do?

Here’s where the compositionality of RDF comes in to play. You can start with a common vocabulary for the whole enterprise, but have customizations of it for each line of business. The lines share the common vocabulary, but use their own “vocabulary plugin” for their own work. They share what is in common, while keeping their own terminology for their own use.

Two sub-vocabuaries being interchanged, while a shared vocabulary stays constant.

This method is useful in a wide variety of contexts. FIBO uses this method to described jurisdictional variations for its models; the base FIBO model describes content that is shared among jurisdictions (e.g., the fact that business entities register with some government entity), but jurisdictional differences (such as whether that entity is a national or sub-national entity, and what liability protections are allowed by law for different sorts of entities) are represented in separate files.

SKOS (the Simple Knowledge Organization System) encourages the use of this method. SKOS defines a class called skos:Concept, for any vocabulary concept. Users of SKOS are encouraged to extend SKOS by creating their own subclasses of skos:Concept; each vocabulary will do this in a different way (e.g., defining subdomains in their own vocabularies), while adhering to the basic structure of SKOS. A good example of this is the structure of the GACS Agricultural Vocabulary, which uses several subclasses of skos:Concept to provide organizational structure to the vocabulary.

This is a pretty simple trick for a graph data representation like RDF; since merging graphs is easy, common aspects of multiple systems can be managed as a single RDF file, while differing aspects for various stakeholders can be managed as their own RDF files. Simple as it is, it has powerful ramifications for governance of metadata. Any group of stakeholders can see the full scope of the things they are interested in by simply merging the common core with their own files, as show in the animations above.

Final remarks

All of the things I describe here are things that the RDF framework supports in theory; you can merge graphs, assign URIs to nodes in graphs, manage multiple configurations of graphs. But just because something works in theory, doesn’t mean it will work in practice. Each of the things I describe here, i.e., switching databases, deep data discovery and managing multiple configurations, are thing that I actually do in my practice, on almost a daily basis. They really do work, and work reliably. Sometimes I forget how spoiled I am.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.