Converting Graphs to Tables

Dean Allemang
7 min readApr 28, 2023

In one of my blogs a few months ago, I showed a common way to wedge multiple values into a table. Here is an example of what I talked about in that blog:

A table where we put multiple values into a single cell

I talked about all the issues with representations of this sort in that blog, and how this is more naturally represented as a graph. But here, I want to use this as an example of how to do some things in SPARQL. Let’s start by supposing that you already have this data in a graph, and you wanted to create a report that looks like the figure above. You might well ask why on earth someone would want to create such a representation, given all the issues that this entails. But if you think of this as a report, rather than as a data representation (e.g., in an ETL or even ELT context), then this isn’t quite so weird; for a human being to read, this sort of table is fairly natural.

I have a different example to work on for this, that’s a bit smaller. Here’s a bit of TTL to show the example:

prefix : <http://www.example.org/multi#>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>

:EntryClack :inIndustry :Telecom, :HealthCare, :Insurance .
:EntryClack :inCollection :Permits, :Registrations .
:EntryClack :everythingElse "clack" .

:EntryLorry :inIndustry :Telecom .
:EntryLorry :inCollection :Registrations .
:EntryLorry :everythingElse "lorry" .


:Telecom rdfs:label "Telecom" .
:HealthCare rdfs:label "Health Care" .
:Insurance rdfs:label "Insurance" .
:Registrations rdfs:label "Registrations" .
:Permits rdfs:label "Permits" .

Or, if you prefer to see this drawn out, here it is (obligatory ChatGPT; this cytoscape diagram was generated by ChatGPT)

Graphic view of the same data in TTL above.

The interpretation of these links isn’t really important, but it could be for example two entries in a data catalog, which are classified by industry, and which are put into various display collections.

I’m using commas to slam multiple values into those cells in this example; you could use whatever you like.

How do you do this in SPARQL? There is an obvious answer, where you get all the values from the item you are querying for, grouping by the item, and concatenating the things you want (query available here):

SELECT (GROUP_CONCAT(?collection; SEPARATOR = ",") AS ?collection_list)
(GROUP_CONCAT(?industry; SEPARATOR = ",") AS ?industry_list)
?everythingelse

WHERE {
?row :everythingElse ?everythingelse.

?row :inIndustry ?i.
?i rdfs:label ?industry.

?row :inCollection ?c.
?c rdfs:label ?collection
}
GROUP BY ?everythingelse

This looks pretty good; you get all the information you need, group by the row, and concat on the things you want in a list. But look at the output:

Right answers, but too many duplicates.

We have all the correct values, but lots of duplicates.

So, we can try a DISTINCT to fix this. Obviously, a DISTINCT at the top level won’t work; those two rows are already distinct. Maybe inside somewhere?

PREFIX : <http://www.example.org/multi#>

SELECT (GROUP_CONCAT(?collection; SEPARATOR = ",") AS ?collection_list)
(GROUP_CONCAT(?industry; SEPARATOR = ",") AS ?industry_list)
?everythingelse

WHERE {
SELECT DISTINCT *
WHERE
{
?row :everythingElse ?everythingelse.

?row :inIndustry ?i.
?i rdfs:label ?industry.

?row :inCollection ?c.
?c rdfs:label ?collection
}
}
GROUP BY ?everythingelse

But this yields the same output.

What’s going on here? The problem is that, even inside the DISTINCT block, we are doing a cross product of ?industry and ?collection. There are six valid responses to that, each one distinct:

So putting a DISTINCT here won’t help; these are six, valid distinct responses to this query.

For every complex problem, there is an answer that is clear, simple, and wrong. (H. L. Mencken)

The problem is that we need the DISTINCT clauses to act independently on each variable. It turns out that there is a way to do that in SPARQL:

PREFIX : <http://www.example.org/multi#>

SELECT (GROUP_CONCAT(DISTINCT ?collection; SEPARATOR = ",") AS ?collection_list)
(GROUP_CONCAT(DISTINCT ?industry; SEPARATOR = ",") AS ?industry_list)
?everythingelse

WHERE {
SELECT DISTINCT *
WHERE
{
?row :everythingElse ?everythingelse.

?row :inIndustry ?i.
?i rdfs:label ?industry.

?row :inCollection ?c.
?c rdfs:label ?collection
}
}
GROUP BY ?everythingelse

I learned that from ChatGPT when I gave it this problem; I wasn’t aware that DISTINCT was available in that context. It makes sense that it should be; this is why.

But let’s keep going. I once had a boss who claimed that if your query needed a DISTINCT, then you hadn’t factored something correctly; either your schema or the rest of your query. Let’s take that maxim at face value, and see how to do this without DISTINCT. First we notice that we want to compute the industry list separate for each row, without looking at anything else; not “everything else”, not collection, just industry. If we do that, we get the answer we want. For industry, anyway.

PREFIX : <http://www.example.org/multi#>

SELECT ?row (GROUP_CONCAT(?industry; SEPARATOR = ",") AS ?industry_list)
WHERE
{
?row :inIndustry ?i.
?i rdfs:label ?industry.
}
GROUP BY ?row

Results:

There aren’t any duplicates here, since the industry is defined by a simple predicate. This looks like what my boss was saying; when I really ask for what I want, I don’t need DISTINCT (it wouldn’t make any difference here).

Do the same for Collection:

PREFIX : <http://www.example.org/multi#>

SELECT ?row (GROUP_CONCAT(?collection; SEPARATOR = ",") AS ?collection_list)
WHERE
{
?row :inCollection ?c.
?c rdfs:label ?collection.
}
GROUP BY ?row

Now we can simply combine all of these, to get the desired result:

prefix : <http://www.example.org/multi#>

SELECT ?collection_list
?industry_list
?everythingelse

WHERE {
?row :everythingElse ?everythingelse .
{
SELECT ?row (GROUP_CONCAT (?industry;separator=",") AS ?industry_list)
WHERE
{
?row :inIndustry ?i .
?i rdfs:label ?industry .
}
GROUP BY ?row
}
{
SELECT ?row (GROUP_CONCAT (?collection;separator=",") AS ?collection_list)
WHERE
{
?row :inCollection ?c .
?c rdfs:label ?collection .
}
GROUP BY ?row
}
}

The SPARQL syntax makes this pretty long; we have to put subqueries into {braces}, and do a separate subquery for each facet we are concatenating over. But when we’re all done, we get the desired answer:

How does this compare to the solution with a bunch of DISTINCTs in it? Both solutions require the GROUP_CONCAT, so that’s the same. The subquery solution requires all that subquery syntax, which is a bit wordy. But it allows you to separate out each group that you are aggregating over; in some sense, it is more explicit. And if you wanted to do something different on one of these than you do in the other, you’d have a place to do it.

You can of course repeat this for as many properties as you please, with another subquery for each one.

Just for fun, I wondered how this would be done in SQL. First off, to do this in SQL, you need to model this situation in such a way that you can do it correctly as a RDB. Here’s a nomnoml diagram of a data model that reflects this data (built by ChatGPT, of course).

A NomNoml diagram of an RDB schema that matches this data.

Since I’m not a SQL expert, I didn’t really want to consult myself on how to do this. And just about any human SQL expert is likely to say to me, “once you have your data factored correctly like in the figure, why do you want to un-factor it into a non-standard form?” Fortunately, ChatGPT isn’t in the habit of challenging human users when we ask it questions like this, so it gave me an answer. Two, actually. After going through the discussion in this blog (I asked it for SPARQL queries, too), it gave me the following answer:

SELECT 
m.id AS entry,
GROUP_CONCAT(DISTINCT i.name SEPARATOR '/') AS Industry,
GROUP_CONCAT(DISTINCT c.name SEPARATOR '/') AS Collection,
m.EverythingElse
FROM MAIN m
LEFT JOIN InJoin ij ON m.id = ij.main_id
LEFT JOIN Industry i ON ij.industry_id = i.id
LEFT JOIN CollJoin cj ON m.id = cj.main_id
LEFT JOIN Collection c ON cj.collection_id = c.id
GROUP BY m.id, m.EverythingElse;

This is pretty much the same answer that it gave for SPARQL, putting the DISTINCT modifiers into each of the GROUP_CONCAT clauses separately. When I asked it to figure out the query from scratch (not having had the discussion about ways to deal with duplicates), it gave me this much longer answer:

WITH IndustryAgg AS (
SELECT
InJoin.main_id,
STRING_AGG(Industry.name, '/') AS Industries
FROM InJoin
JOIN Industry ON InJoin.industry_id = Industry.id
GROUP BY InJoin.main_id
),
CollectionAgg AS (
SELECT
CollJoin.main_id,
STRING_AGG(Collection.name, '/') AS Collections
FROM CollJoin
JOIN Collection ON CollJoin.collection_id = Collection.id
GROUP BY CollJoin.main_id
)
SELECT
MAIN.id,
MAIN.EverythingElse,
COALESCE(IndustryAgg.Industries, '') AS Industry,
COALESCE(CollectionAgg.Collections, '') AS Collection
FROM MAIN
LEFT JOIN IndustryAgg ON MAIN.id = IndustryAgg.main_id
LEFT JOIN CollectionAgg ON MAIN.id = CollectionAgg.main_id;

This looks a lot like the SQL version of the SPARQL query I wrote, where we separately aggregate each dimension, then report them all together at the end. Just like my SPARQL version, no DISTINCTs were used in this solution.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.