Why I’m not excited about RDF-Star

Dean Allemang
10 min readFeb 22, 2023

I did a companion Vlog with Ashleigh Faith on this same topic — view it here.

One of the issues that inhibits adoption of RDF is the the perception that it is lacking somehow in representational power. Given that RDF is at least as powerful as the formalism behind relational databases, this seems a bit suspicious. But a much fairer assessment would be that RDF is lacking in representational convenience of some sort. Sure, you can do the things you want to do, but it is difficult somehow (and indeed, I have made a similar complaint about tabular representations). What are some of these inconveniences, in the case of RDF?

Examples

Here’s an example. I attended a book club where we were reading Michael Uschold’s excellent book, Demystifying OWL for the enterprise. In the second chapter, Michael gives an example of a patient visiting a doctor, which he models in great detail. In his example, a patient named John Doe visits Doctor Smith, and during that visit also receives care from Nurse Wilson, who works for Dr. Smith. I will only look at the first level of detail, in which Michael models the visit, for which there are two Care Providers and a Patient. The data about John Doe’s visit, expressed according to Michael’s model, can be summarized like this:

Graph representation of a doctor visit, with two care providers, Nurse Wilson and Dr. Smit, and a single care recipient, John Doe
Data represented according to a summary of the model given in Demystifying OWL.

Michael goes on to model the time and date when the visit occurred, the names and qualifications of the providers, the tests and procedures that were carried out, and so forth.

One of the book club members made an interesting comment; he asked why Michael hadn’t modeled this in the ‘obvious’ and ‘simple’ way, by saying something like:

:JohnDoe :visited :DrSmith . 

Michael handled the question with grace, but in some sense, the point of his whole book is an answer to this question. Why do we model data one way rather than another?

But let’s take a closer look at this example. Suppose we wanted to add in a test that was taken when John visited Dr. Smith, say, he had his body temperature taken. How would we express that fact, if we started with the “obvious, simple” representation? This is pretty difficult to do; you need a way to talk about the statement “John visited Dr. Smith” to say something like, “and while he was there, he had his body temperature measured.” And we probably want to go on to say, “the value was 98.6°F.” This is not convenient in RDF; if this is how you started out, you’ll find RDF deficient pretty quickly.

Here’s another example. In a very old article, Jesus Barrasa describes various data models, and gives an example of a flight from New York to San Francisco. He represents this connection in RDF as follows:

:NYC :connection :SFO .

Once you start here, then putting in things like the price and the distance flown is difficult; it seems pretty inconvenient to represent this in RDF.

RDF-Star to the rescue?

At this point, RDF enthusiasts are tempted to sort of sheepishly give in, and mumble something indistinct about “RDF-Star will fix all of this”.

So what is RDF-Star? According to the working group charter, it is “[an extension of] RDF and SPARQL related recommendations, with the ability to concisely represent and query statements about statements.” In short, it is a way to include triples whose subject or object are themselves references to RDF triples. It has been available in proposal form for many years, and is on the road to being standardized. So back to the mumbled response; can RDF-Star fix these inconveniences with RDF?

There are two things wrong with this; first off, these examples are not examples of RDF being broken; they are just examples of poor ways to use RDF. Second, even if you don’t fix your RDF modeling practice, RDF-Star isn’t going to make these problems any better. And in fact, if you use RDF-star to fix these things, you’re just compounding the issues.

Is it really broken?

So let’s start with that first statement, that these things aren’t examples of RDF being broken. Let’s start with Michael’s Doctor Visit example. One reason why the “obvious, simple” representation of John Doe visiting the doctor seems simple is because it only uses one triple. John Doe visited Dr. Smith. Simple! In contrast, Michael’s solution uses two triples; the visit has a care provider (Dr. Smith) and a care recipient (John Doe)¹. One more triple doesn’t seem like a lot, but if you do this for every statement, that’s twice as many triples. And that means twice as much space, twice as much to think about, twice the overhead.

But this apparent disadvantage amortizes pretty quickly. When I add in the participation of Nurse Wilson, Michael adds just one more triple (shown in the figure). The “simple, obvious” solution has to add at least one:

:JohnDoe :visited :NurseWilson . 

But this representation has not yet made the point (which Michael’s solution does) that the two visits, one to Dr. Smith and one to Nurse Wilson, were actually the same visit. I don’t quite know what the “simple, obvious” way to do that is, but it’s going to take at least one triple. Or, if you want to leave triples behind, one cell in a table, or one key-value pair, or one something. Michael’s three triples aren’t looking so extravagant now.

There’s another cool thing about Michael’s solution. I don’t know about you, but when I call the doctor’s office, I make a thing I call an appointment. That appointment is for a visit. And after I leave the office, they send me a note with the subject, “Your recent visit.” A visit to a doctor’s office isn’t some new thing I am inventing just to get around some deficiency in my modeling system; it’s actually the way I talk to my doctor’s office, and the way they officially communicate with me. Furthermore, my visit summary includes an indication of the caregiver(s) (okay, they don’t use that word), and all the tests that were taken, along with their results. The page “Visit Summary” looks a lot like Michael’s model.

If at this point you have decided that modeling a visit isn’t such a bad thing to do, you might be wondering how to find notions like “visit”, that appear in our everyday discourse, and know when those are things you should be modeling. If you are wondering this, you’re in luck; Michael’s book (linked above) needs to go on your reading list.

Now let’s have a look at the flight example. Now, I don’t know about you, but I have occasionally spoken to the person next to me on a flight, and found out that they paid a different price for their ticket than I have. So its a bit disingenuous to say that the connection from NYC to SFO has a cost. Even Priceline made quite a business years back by letting you make an offer to name your own price for a ticket; the route from NYC to SFO didn’t have a price, it had offers, and offers have prices.

a man and a woman sitting together on an airplane, looking at a laptop and some documents.
I sometimes talk to strangers on planes.

Oh, and read that last paragraph again. I referred to a few concepts that are probably familiar to you; a flight, which goes from one place to another, a ticket, which gives you the right to a seat on that flight, and an offer to purchase that ticket, which (even if you aren’t using Priceline) you have the option to accept or refuse. When you model all this out, it looks a lot like Michael’s example of a Doctor’s Visit. Yes, it uses a lot more triples than just “NYC connection SFO”; but it also matches with the way we normally talk about these things. If we get back to our three things we want to do with data, i.e., publish, find and merge, this makes our data a lot easier to understand, supporting successful publication and reuse of the data. The cost to do this isn’t in the complexity or size of the published data; in fact, as we saw above, the size isn’t much higher, if at all. The real cost is that you have to think a bit more about your data before you publish it.But this isn’t new thinking you wouldn’t have done otherwise; it’s the sort of thinking that professionals do when they think about their data in the first place.

Can RDF-Star fix it?

Now let’s address the second point; will RDF-Star make this situation better? Suppose we were to provide the capability to add more triples to describe statement like

:NYC :connection :SFO .

We could annotate this statement with a cost and a date, but since flights happen every day (and for those two destinations, probably many a day; let’s say it’s five a day), we’d have to figure out for each day, five costs and five date/times, which one goes with which. So we still have a reification problem like the ones we explored with tables; we have to figure out which one goes with which.

We could solve this by having multiple triples, e.g.,

:NYC :connection :SFO .
:NYC :connection :SFO .
:NYC :connection :SFO .

each with its own date/time, price, flight number, crew list, equipment, etc. But now, these statements, one per flight, are looking an awful lot like what we normally refer to as a “flight”; far better modeling practice would be to actually model it as a thing called a flight, and describe it in everyday terms. Which is what were doing without RDF-Star. Using RDF-Star simply encouraged poor modeling practice, of ignoring the everyday name for something (a “flight”) and using a statement in its place. And when I say “everyday name”, for those of you who do enterprise data modeling, you should read “business terminology”, since in an enterprise setting, these terms are the terms that the business uses (who do you think taught me a word like “offer”? A data modeler? Or an investment banker? Or William Shatner?).

We can make similar statements about the doctor’s visit example; but here we have the advantage that Michael has already worked this example out in detail. And he did it all without using RDF-Star.

But RDF-Star really is cool

I hope that my provocative title didn’t alienate any of my friends and respected colleagues who are working diligently on the RDF-Star committee even as we speak to finalize that Recommendation. I actually do think that RDF-Star is pretty cool, and is going to make a big difference in the RDF world. I just don’t want to see it misused, as a band-aid to cover up poor modeling practice.

So what is a good use of RDF-Star, and how can you tell which is which? This is actually easier to sort out than it sounds. A triple in RDF is a logical statement about a relationship between two things; it can be either true or false. In plain old RDF, a dataset is a collection of such statements, and the whole thing is an assertion that these things are true. You can draw conclusions based on these things. The RDF standard doesn’t give you any advice about whether you should trust a particular statement, or how much confidence you should have in it, if any at all.

With RDF-Star, we add in the ability to make statements about statements. What sorts of things do we have to say about statements? We could tell “who claims this statement is true?” or “at what point in time was this statement true?” or even “when did we learn about this statement?”. In all the examples in the first part of this blog, the statements we made weren’t about the statement; the price statement about the connection from NYC to SFO wasn’t about the statement, it was about the flight. Statements about medical procedures during a visit aren’t statements about the statement “John Doe visited Dr. Smith”, but about the visit itself. We abused RDF-Star because we used a facility for making statements about statements to instead make statements about doctor visits or airline flights. The answer to the question, “when should I use RDF-Star and when shouldn’t I?” is actually right in the RDF-Star working group charter; it is for making statements about statements, and that’s all. It’s not for making statements about things in your domain, like flights, doctor visits, procedures, or costs.

So RDF-Star really is important, because it lets us address a lot of things about data usage that aren’t addressed in most data systems, including graph stores, document stores and relational stores, at all. If we want to know whether we want to trust a datum, a good way to figure that out is to look at its source. RDF-Star will let us do this directly. If we have some confidence in our data (or lack of confidence), we can annotate statements with some measure of confidence; maybe a probability estimate or some fuzzy measure. Temporal and bi-temporal reasoning is important in many situations; RDF-Star provides a way we can represent time scales on statement, in a standard, interoperable way. This opens the door for RDF database vendors to provide standards-based approaches to things like temporality and bi-temporality.

There are probably more creative ways to appropriately use RDF-Star; some that come to mind when modeling policy is jurisdiction; under what jurisdictions is this policy statement valid? For data governance, who is allowed to know the content of this statement? I bet there are a lot more; I invite my readers to come up with more. The only caveat is that the example is of a statement about a statement, not a statement about something else.

So there’s actually a lot to be excited about, after all.

¹ In the example I show here, Michael uses three triples, but that’s because I’ve already mentioned Nurse Wilson, which the simple, obvious representation does not.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.