Constructing SPARQL Queries

From simple queries to federated queries and property paths

If you have not read any of my previous posts and are new to RDF and Linked Data you may want to check out some of my previous articles:

Why Use Linked Data to Tackle Big Data Problems
Understanding Linked Data Formats
Creating Linked Data with OpenRefine

In that last article we created a dataset which you can download if you wish to try out the queries we will construct here.

What is SPARQL?

SPARQL (SPARQL Protocol and RDF Query Language) is the query language we use to shape and return linked data from a triplestore.

SPARQL queries contain triple patterns, much like the data itself, which utilise the relationships to quickly navigate any linked data. This language is common for all linked data so queries can traverse across multiple RDF databases at once. Query 7 in the next section is an example of this powerful characteristic.

First Examples:

The easiest way to start learning SPARQL is by example so lets start with a simple query and build from it. The following query returns every triple in the triplestore:

SELECT ?subject ?predicate ?object
WHERE {
?subject ?predicate ?object .
}

This query selects all triples matching the pattern: ?subject ?predicate ?object which is all triples. The ? indicates a variable so ?example is a variable called “example”. These variables match every possible entity, predicate or literal that fit the patterns in the query.

In this query we are selecting every variable in the pattern so we can represent that with a * character. Also, variables can be called anything so the following query is almost exactly the same as the one above:

SELECT *
WHERE {
?s ?p ?o .
}
LIMIT 200

The only difference in results is that this query only returns 200 triples. We usually set limits in queries like these as there can be billions of matching triples! If you want the next batch you can add OFFSET 200 on the next line. This allows you to batch process results if needed.

Obviously to retrieve useful results you will need to fix variables. Fixing the predicate we can get twenty entities with their labels:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT *
WHERE {
?entity rdfs:label ?name .
}
LIMIT 20

Entities are connected to their human-readable names with the predicate rdfs:label so ?entity rdfs:label ?name matches any triple with the predicate rdfs:label and returns ?entity and ?name.

We have defined a prefix at the top of the query so that we can make the query more readable. Removing the top line and replacing rdfs:label with <http://www.w3.org/2000/01/rdf-schema#label> will return the same results. In larger queries you will notice the benefits of prefixing as the same prefix can be used multiple times in one query.

Olympics Examples:

Now that we have gone through the basics, lets move on to some more specific examples. Again, we will be running these on this data.

Query 1:

This query lists the names of every gold medallist (excluding duplicates):

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?name
WHERE {
?instance walls:athlete ?athlete ;
walls:medal <http://wallscope.co.uk/resource/olympics/medal/Gold> .
?athlete rdfs:label ?name .
}

I begin by listing the prefixes used in the query to make it more readable. We are wanting the names of the athletes so I have selected o return the variable I call ?name . The DISTINCT clause after SELECT filters the names to remove duplicates. This is often required as an entity can fit a matching pattern multiple times (in this case an athlete that has won multiple gold medals).

Looking at the data we can see that athletes are linked to their medals through the instance entity which we therefore start from. The first triple we match is:

?instance walls:athlete ?athlete ;

This grabs every entity with an outgoing predicate walls:athlete and the athlete entity.

If you notice, the pattern is ended with a semi-colon rather than a full-stop like we have seen so far. This is because the next triple shares a subject allowing us to shorten:

WHERE {
?subject ?p1 ?o1 .
?subject ?p2 ?o2 .
?subject ?p3 ?o3 .
}

to (note the final full-stop)

WHERE {
?subject ?p1 ?o1 ;
?p2 ?o2 ;
?p3 ?o3 .
}

Thanks to this shorthand, the next line of our query connects the instance entities to their Gold medal entities. As these two patterns are in the same query, our result set is now reduced to every instance that connects an athlete to a gold medal.

We finally want the human readable names of the athlete entities so we get their labels and assign them the variable ?name as this is what we want to return.

We are therefore returning a list of human-readable names of athletes that got gold medals at the Olympics (with duplicates removed) as desired:

You can use prefixes to shorten entities, for readability, but it is not as common as they do not tend to be repeated as much as predicates in large queries. Query 1 could be shortened to this for example however:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX medal: <http://wallscope.co.uk/resource/olympics/medal/>
SELECT DISTINCT ?name
WHERE {
?instance walls:athlete ?athlete ;
walls:medal medal:Gold .
?athlete rdfs:label ?name .
}

Query 2:

This query lists the names of every athlete, with at least one medal, alongside their total number of medals (sorted by the number of medals):

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?name (COUNT(?name) As ?noOfMedals)
WHERE {
?instance walls:athlete ?athlete ;
walls:medal ?medal .
  ?athlete  rdfs:label    ?name .
}
GROUP BY ?name
ORDER BY DESC(?noOfMedals)

Instead of fixing the medal to gold, we want to grab every athlete with any medal. We do this by replacing the gold medal URI to the variable ?medal .

If, for example, an athlete called James got a silver medal in 2000 and a bronze medal in 2004 he would be listed two times. In Query 1 we removed duplicates using DISTINCT but we want to count the number of medals and can utilise this duplication. In Query 2 we count the number of times ?name appears and call this new variable ?noOfMedals to be returned. This is done on the select line with:

(COUNT(?name) As ?noOfMedals)

We then GROUP BY ?name to combine the counts by name to get a list of athlete names alongside the number of times that name appeared in the list.

Finally we want to order this list by the number of medals each athlete won with the biggest number first. To order a list in descending order we add the line:

ORDER BY DESC(?noOfMedals)

This returns our desired list of athletes ordered by the number of medals they have won:

Query 3:

This query lists each country alongside the average height and weight of its athletes (sorted by the average height):

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?countryCode
(AVG(?height) As ?avgHeight)
(AVG(?weight) As ?avgWeight)
WHERE {
?noc dbo:ground ?team ;
rdfs:label ?countryCode .

?athlete rdf:type foaf:Person ;
dbo:team ?team ;
dbo:height ?height ;
dbo:weight ?weight .
}
GROUP BY ?countryCode
ORDER BY DESC(?avgHeight)

If we take a look at an athlete entity, in this case Addis Abebe, we can see the predicates we used to link each athlete to their attributes:

We are focusing on the right hand panel for athlete attributes

COUNT is not the only operation that we can perform on variables, in this query we want to calculate the mean of our variables which we can do using AVG in the select line.

We grab each country with their attached teams and labels. Using the teams we find the athletes in those teams with their heights and weights. We group by the country codes so that all the athletes are divided into their respective countries. Using AVG in the select line we calculate the average heights and weights in these groups before assigning them to new variables to be returned. Finally we sort the results by the calculated average height with the tallest first. This returns the desired results:

Query 4:

This query lists each year alongside the oldest Judo competitor in that year:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?year (MAX(?age) As ?maxAge)
WHERE {
?instance walls:games ?games ;
walls:event ?event ;
walls:athlete ?athlete .

?event rdfs:subClassOf <http://wallscope.co.uk/resource/olympics/sport/Judo> .

?games dbp:year ?year .

?athlete foaf:age ?age .
}
GROUP BY ?year

As well as COUNT and AVG we can get the maximum number in a group of numbers using the operator MAX . In this query we first grab every instance with their associated Olympic games, event and athlete. With these we match only those instances that are linked to events that are classified as Judo events.

We then grab the years of the Olympic games and the ages of the athletes to gather all the information we need. We group all the results by their attached year with a GROUP BY and use the MAX operator in the select to return the max age in each year group.

This returns the desired result table, displaying each year alongside the age of the oldest Judo athlete competing in that year:

Query 5:

This query lists every athlete with “louis” in their name alongside the city and season that they competed in:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .

?games dbp:location ?city ;
walls:season ?season .

?city rdfs:label ?cityName .

?season rdfs:label ?seasonName .

?athlete rdfs:label ?name .

Filter (REGEX(lcase(?name),”louis.*”))
}

This query uses regex to check the athletes names for “louis” so this query can be run on all triplestores. We can use the full text index functionality instead, specific to each triplestore, which I show an example of later.

Apart from this regex expression, nothing in this query is new. To get all the information we need however we need many patterns in the query (which is why I am showing this example).

We use the FILTER clause on a variable to narrow down the result set as required. In this case we run a simple regex expression on the ?name variable so that we only return athletes with “louis” in their name. Near the end of this article I will give another, simpler, FILTER example.

Once all the patterns are found and results filtered, this query returns every athlete with “louis” in their name alongside the city and season they competed in:

Query 6:

This query gets a list of every city that an athlete from Serbia and Montenegro competed in. It then counts the number of males and females that ever competed in one of those cities and return these counts:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbp: <http://dbpedia.org/property/>
PREFIX noc: <http://wallscope.co.uk/resource/olympics/NOC/>
SELECT ?genderName (COUNT(?athlete) AS ?count)
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .

?games dbp:location ?city .

?athlete foaf:gender ?gender .

?gender rdfs:label ?genderName .

{
SELECT DISTINCT ?city
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .

?athlete dbo:team ?team .

noc:SCG dbo:ground ?team .

?games dbp:location ?city .
}
}
}
GROUP BY ?genderName

Again by design, this query is deliberately large and crosses the entire knowledge graph twice. This is to give an example of a subquery and to test triplestores in a future article.

Subqueries are evaluated from the innermost query to the outermost. Therefore in this example the city selection is resolved first, making the city variable available to the outer select query.

Beginning with SELECT DISTINCT ?city , this sub-select grabs every instance connecting an athlete to the games they played at. The team of each athlete is collected and matched to check if the teams grounds are located in Serbia and Montenegro (NOC code SCG). The current result-set at this point contains every athlete in a team from Serbia and Montenegro. To finish this inner query we finally grab every city that any of these athletes have ever competed in and remove duplicates.

This passes the ?city variable to the outer query with these results. This outer query grabs all Olympic games that were ever played in one of these cities. Using the instance entities, every athlete that played in these games are gathered with their attached genders.

These results are then grouped by gender and finally counted to return each gender with the number of athletes that competed in one of the returned cities:

Query 7:

This query lists every sport’s name that has a stored team size in DBpedia and return these sizes next to the corresponding sport.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT ?sportName ?teamSize
WHERE {
?sport rdf:type dbo:Sport ;
rdfs:label ?sportName .

SERVICE <http://dbpedia.org/sparql> {
?dbsport rdfs:label ?sportName ;
dbo:teamSize ?teamSize .
}
}

Compared to the last query, this one is much simpler. We simply grab each sport with their names.

Using SERVICE we can traverse triplestores within a query. If you head to http://dbpedia.org/sparql you will find the DBpedia (linked data version of Wikipedia) endpoint which this query uses. The line:

SERVICE <http://dbpedia.org/sparql> {

connects to this endpoint and sends the inner query. Every sport with an attached teamSize in DBpedia is found that has a name in the ?sportName variable. These queries are sometimes run inner-most and sometimes outer-most first depending on whether the triplestore optimises these queries. This will be covered in my next article.

We therefore have each sports name with the team size of that sport from another triplestore:

Query Full Text Index:

This query lists every athlete with “louis” in their name alongside the city and season that they competed in. This query uses each triplestores full text index feature unlike Query 5.

PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?name ?cityName ?seasonName
WHERE {
?instance walls:games ?games ;
walls:athlete ?athlete .

?games dbp:location ?city ;
walls:season ?season .

?city rdfs:label ?cityName .

?season rdfs:label ?seasonName .

?athlete rdfs:label ?name .

?name luc:myIndex “*louis*” .
}

This query is exactly the same as Query 5 apart from the method it uses to filter the athletes names. Most triplestores have a full text index feature to speed up text queries like this. In the above example we are using GraphDB so the full text index predicate is luc:myIndex (you can see the luc prefix full form in line 1 of the query).

These, as optimised per triplestore, tend to be much faster than regex filters but have to be tailored to a specific triplestore. For example, running this query on the DBpedia endpoint above would return nothing as they are using Virtuoso.

Additional Features:

I feel that I did not cover FILTER enough so in this section will give one further filter example and cover Simple Property Paths.

Filtering:

Filtering, as the name suggest, helps filter your results to return exactly what you need. In Query 5 I used FILTER to run a regex expression over a group of string literals. Filters can be run simply on entities also, for example:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?name
WHERE {
?instance walls:athlete ?athlete ;
walls:medal ?medal .
 ?athlete rdfs:label ?name .
 FILTER(?medal = <http://wallscope.co.uk/resource/olympics/medal/Gold>)
}

This query is simply Query 1 using a filter instead of a fixed variable. I have exposed ?medal as a variable and then filtered this variable to only grab gold medal entities.

Filters can be used on numbers (FILTER(?length > 3)), dates and variable types (FILTER isLiteral (?entity)) so they are very powerful when needed.

Simple Property Paths:

You can probably see in many of the above queries, we do not at any point actually use assigned variables, we just name them to link to something we need. Simple Property Paths avoid this as we can use a / instead of a variable name. For example this query:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?cityName
WHERE {
?instance walls:games ?games .

?games dbp:location ?city .

?city rdfs:label ?cityName .
}

can instead be written as:

PREFIX walls: <http://wallscope.co.uk/ontology/olympics/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/property/>
SELECT DISTINCT ?cityName
WHERE {
?instance walls:games / dbp:location / rdfs:label ?cityName .
}

This is very useful if you need to run a query that traverses across a large knowledge graph.

Conclusion:

Hopefully by this point I have covered many common SPARQL features clearly enough. In my next article I will deploy many of the most common triplestores and run these queries to compare them fairly.