From Zero to SPARQL with no installs

Dean Allemang
13 min readFeb 14, 2023

I was at a conference a while back, and someone said that they had trouble getting started with RDF. They had downloaded the community version of Neo4J, and got started right away. Rather than ask them why they didn’t download the open source version of Jena, or a free evaluation version of Virtuoso, or GraphDB, or or Free Edition of AllegroGraph, or even just visit the open SPARQL endpoint for dbpedia, I want to describe a no-download way for anyone to start querying their own data as RDF. No-download is important if you work in a locked-down setting where you can’t install software yourself (I used to work for a bank, so I have a lot of familiarity with this). Your own data is important because, well, you know your way around that data, so you have some idea what value it might have. The best way I know of to do this is at data.world.

Even if you are already a SPARQL or RDF fan, and you already have a fuseki endpoint running on your AWS cloud instance, and you already have TopBraid Composer on your desktop, you still can find data.world pretty useful. You can share your datasets and queries with your buddies on the web, as I have done with the Working Ontologist website. And you don’t have to keep maintaining that AWS service.

I’m going to assume you don’t have a data.world account yet, and walk you through getting started with RDF. You won’t download anything and you won’t install anything. You will need an account on one of google, facebook, github or twitter (or you can create a new account on data.world with just your email address).

We’ll start at the data.world front page, and click “sign in”.

banner that shows data.word and its header menus; the important one here is the “sign in” button.
data.world welcome banner

Then pick one of the identity services; google, facebook, github or twitter. In this example, I’m going to use twitter.

data.world login page.
You can sign in to data.world using your google, facebook, github or twitter account.

data.world will ask you to login to that service, and tell you what information data.world will need to access.

screenshot from data.world requesting access
data.world needs some limited information from your account.

Now sign in to your service

twitter sign-in page
Twitter gives you a bunch of sign-in options, too.

This will bring you to your own data.world front page. You’ll even have your picture in the corner, if you had a picture in the service you logged in from. Your picture won’t be as cute as mine, unless you also have a picture of yourself posing with a koala.

screenshot of data.world landing page
data.world welcome page, once you’ve logged in. Koala not included.

What happens next? You could search data.world’s 100K+ public datasets for something of interest, or you could add something of your own.

A koala who is not downloading anything

For this example, I’m going to use some data from the US Postal Service. If the data you are using belongs to you (i.e., you created it), you can go ahead and use that. But there is a lot of public data out there that could be of use to a lot of people (including you!); you can do the world a service by bringing it into data.world.

In the upper right of your web page, right next to your picture without a koala, there’s a button that says “+New”. Click that. You have the choice of creating a dataset or a project. Select “Dataset”. You’ll be asked to name your dataset, and whether you want to share it with no one, or with the whole community. Live a little — share it with us!

Screenshot of data collection interface for data.world
Give your dataset an evocative name, that you’ll remember.

When you create the dataset, you’ll be asked to give it a description (please do a favor to future you and help them remember what you are doing today. This also helps the rest of us find your data). You could add data right now, but I’m going to ask you to wait a minute before you do that; just hit “Continue”.

Screenshot of dataset metadata page
Describe your dataset. Future you will thank present you.

Now you are at the front page of your new dataset. You see the name you gave it, and the description. And see, I told you so, you have the opportunity to add data, just like you did earlier. But I put a Big Red Circle on this screenshot; this is where you edit some metadata:

Screenshot of dataset resource page on data.world
Before you load data, edit the metadata.

Why am I having you edit metadata before I have you upload data? Because I am going to have you include the licensing of the data you are uploading. This is very important; technologically, you can upload data you have no right to re-publish. But that doesn’t make your data FAIR (that’s the “R” in FAIR — that you are allowed to re-use it). So select a license. If you don’t know that your data is available for re-use, stop here and find one that is. Or create some data yourself, so you have the right to say what license governs it.

Screenshot of dataset metadata page, selecting open data license type.
Only upload data that you have the right to upload.

In this case, the data comes from the US government, and is the subject of a FOIA request, and so by law is in the Public Domain. So select that license. You can edit your description as well at this point, if you feel you have more to say.

Now, let’s get to the data. In my case, I’m picking up a spreadsheet from a website. You might have a spreadsheet of your own, or maybe you have an RDF file (in just about any serialization you want), or maybe it’s JSON, or XML. The most common case for getting started data is some sort of spreadsheet, so that’s what I’ll do.

There’s a lot of interesting data on the US Post Office FOIA site. I’m going to look at Change of Address Stats. These files are available as CSV, so I could download one to my hard drive, and then drag-and-drop it to that big blue “Add data” button. If you have your own data, that’s what you’ll want to do.

But I’m going to do it one better. Remember, I said you wouldn’t have to download anything, that includes the data. I’m going to click that “Add data” button, and then choose “sync from URL”, then I’ll paste in the URL to the CSV I want to use.

screenshot of data.world data upload wizard
You can drag-and-drop a file, or sync data from a URL

Since the US Post Office FOIA library is a public domain web site with no authentication, I don’t have to put in authentication stuff. Then I hit “Continue”.

Give your file a name on data.world (not necessarily the same as the one it came with)

The pop up has the original name of the the file (a rather undescriptive “Y2023.csv”; I spruced that up a bit so future me won’t scratch his head wondering what I did here. I also select “Autosync weekly”; after all, this is the 2023 data, I hope the Post Office updates it fairly regularly, so I will have data.world keep up with the latest.

data.world provides a preview of your data

After about a slow count to ten, I now have this CSV loaded on data.world. You can see a preview of it on the dataset front page. I invite you to click that little “Switch to column overview” to learn some interesting stats about this data right off the bat, but this blog is already too long, so I won’t go into those details here.

Instead, I’m going to start writing queries over this data. Suppose we want to get a profile of people coming to New York City, broken down by month. I have included the 2022 data as well (doing the same sequence of operations I did for 2023), so we have something to go on. Start by clicking “Launch Workspace” in the upper right of the dataset screen. This is where we can start writing queries, looking at our data, etc. We’re going to use SPARQL to query this data (you can use SQL if you like, but do that on your own blog!). Select “+Add” near the upper left, and then “SPARQL Query”.

Create a SPARQL query

What is something we might want to ask about this data set? It would be helpful if we knew a bit what the data is about. We’re in luck here, because the post office has included an explanation of the fields. You can read that, but if you don’t want to click through, basically for each zip code in the US, for each month in the given year, it provides a count of how many change of address forms were filed to leave that zip code, and how many to come to that zip code, broken down in a number of ways.

We can of course filter down the data to some subset. Suppose we want to just find the total number of people who went in and out of each zip code in New York City, per month. You could use some of the methods for exploring an RDF dataset that you’ve never seen before, but for a spreadsheet like this, you can do a lot better. From the explanation the Post Office gave us, we know that the fields we want for this are called YYYYMM (for the month), zip code, city and state (to figure out which ones are in New York City), and two fields called Total Perm (one for requests coming in, the other for requests going out).

Since this data is in a CSV, you can see strange things like two fields with the same name. When data.world turns this into a graph, these two are given unique identifiers, using its naming conventions. So how do we know what these identifiers are? Fortunately, we don’t have to build them up using the naming conventions; data.world gives us a map of the spreadsheet to get us started.

The column overview is the start of an online data dictionary. It also knows the correspondence between column names (not unique) and IRIs (globally unique)

For each spreadsheet we’ve uploaded, we see all the fields that were in it (and in the case of duplicates, we see that data.world has appended a “_2” to one of them to make them distinct). Also, if you hover over one (like I did here for total_perm) you can just click on it to copy that IRI to your paste buffer. You don’t need to know how that IRI is built.

So we can build our filter query just by copying and pasting the properties we need:

SELECT ?date ?zip ?out ?in  
WHERE {
?row <https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-zipcode> ?zip ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-city> "NEW YORK" ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-state> "NY" ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-total_perm> ?out ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-total_perm_2> ?in ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-yyyymm> ?date .
} ORDER BY ASC (?date) ASC (?zip)

We’re filtering to find records about New York City by using literals for those properties, in contrast to the variables we use for the others. So far, we’ve just echoed information that was in the table. How about a little analysis? Suppose we want to know how many zip codes gained residents vs. how many lost residents in each month? We can compute which way each zip code went, and add them up per month.

SELECT ?date ?zip (SUM (?plus) AS ?gain) (SUM (?minus) AS ?loss)
WHERE {
?row <https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-zipcode> ?zip ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-city> "NEW YORK" ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-state> "NY" ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-total_perm> ?out ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-total_perm_2> ?in ;
<https://dinosterous.linked.data.world/d/post-office-data/col-change_of_address_y2022-yyyymm> ?date .

BIND (IF (?in>?out, 1, 0) AS ?plus)
BIND (IF (?in<?out, 1, 0) AS ?minus)
} GROUP BY ?date ?zip order by ASC(?date)

It’s pretty cool that we were able to import a spreadsheet and start querying it with SPARQL without downloading anything at all, but let’s face it, SPARQL is not going to shine when you’re just querying a single file that started off in tabular form to begin with. But as I pointed out in an earlier post, graphs really shine when you want to merge data. Let’s see how that works on data.world.

First off, let’s find some data we might want to merge with our post office data. Again, data.world comes to the rescue. Hunting around the community datasets on data.world, I discovered a dataset about bedbug infestations in New York City. Suppose we wanted to know, is there any correlation between bedbug infestations and arrivals/departures from a particular New York zip code?

This blog isn’t intended to be a tutorial for statistics (just what do I mean by “correlation”?) or writing elaborate SPARQL queries (how do I aggregate the number of infestations, while adding up the arrivals/departures and sort by zip code?). But it is a blog about data merging — so, given that we found a dataset among the more than a hundred thousand community datasets on data.world that might be relevant, how do we combine it with this dataset we just uploaded?

You might be tempted to download the bedbug data, then upload it again into data.world, alongside your post office data. But that’s only if you happen to enjoy loading data into databases. Or, since you haven’t downloaded anything yet and you don’t want to start now, you might want to create a link to the bedbug data alongside your post office data. But both data sets are already on data.world — can’t we just query them together?

Of course we can. data.world is a broadly scalable platform; it can manage hundreds of thousands of datasets. While you can’t query of all of them at once (that’s too much of a large-scale job even for data.world) you can query over any subset of them (performance degrading if your subset is very large). So let’s see how that goes.

We saw earlier how to start a new SPARQL query. Let’s make another one.

But in that same screen, we see all of the data we’ve uploaded. You’ll see that I’ve uploaded a few more post office files. For this example, the one we’re interest in is the 2019 data. So we’re going to use the federation feature of SPARQL, where we can name the sources we are querying over. In this case, we want to query over the 2019 file. So in data.world, we select the IRI of the graph that corresponds to that file, as shown here:

Starting a federated query

Why do we need an IRI, instead of just the filename? Remember that we are distributing over hundreds of thousands of files; we can’t expect all those filenames to be unique. The whole point of an IRI is to have a globally unique name for resources on the web. The Semantic Web (and by extension, data.world) uses the same method to identify data files.

What do we do with that? We paste it into a FROM clause in the SPARQL query (I’ve already done the paste in the image above). This says that the SPARQL query will use data from the 2019 post office data we uploaded.

How about the bedbug data? We can do the same thing there; we navigate to the “workspace” for the bedbug data, and copy the graph IRI of the data we are interested in.

Getting the globally unique identifier for a file on data.world

Now we paste that into our original query.

The most simple federated query, pointing to two data files from different users, subjects, and eras.

Notice what we’ve done here; we found some data that someone else uploaded a few years ago, combined it with our own data, simply by using the IRI for that file as a source in a SPARQL query. In terms of the Cool Things I do with RDF, we’ve superimposed the bedbug data on the post office data. The query going forward will work over the merger of these two data files.

I won’t leave you hanging about how we might query these two things together; there’s a query that starts off just like the figure, that cross-references zip codes and dates, adds up all the units that were infested with bedbugs in that zip code, and computes the net gain/loss of people due to change of address, and reports those side-by-side. The query has to sort out the fact that the two datasets represent zip codes differently (what’s that about?) and one measures specific days whereas the other only measures months, but these issues are to be expected whenever you have hundreds of thousands of data stakeholders each managing data according to their own ability, knowledge and needs.

So there you have it; find some data on the web (or make some yourself), load it into data.world, and start querying. If you find something useful in just your own data, good for you. But once you start querying, feel free to look around data.world’s community. Who knows what you’ll find that might spark a new line of research. And who knows? Maybe next week, another data journalist will find your dataset, and get even more value from it.

Happy dataing!

PS. I was quite inspired by this study about people coming and going in New York vs real estate prices. I’ll probably come back to this in more blogs. The friend who pointed me to this story commented, “I wonder if the same analysis holds for [his home town] Seattle? ” The cited article refers to the Post Office data as “laptop-choking”, and indeed, if you load all of it into data.world you do exceed your free allotment of 10M. But as far as filtering goes, it isn’t SPARQL-choking at all! None of these queries take more than a couple seconds.

PPS. I did all the operations for this blog using a brand new account I made for data.world just for this blog, rather than my data.world account that might have some privileges that you, the reader might not have. So you, the reader, should be able to everything that I did here. Except cuddling a koala. You’ll have to organize that yourself.

--

--

Dean Allemang

Mathematician/computer scientist, my passion is sharing data on a massive scale. Author of Semantic Web for the Working Ontologist.