Loading BBC Good Food Data into Virtuoso

Daniel Heward-Mills
OpenLink Virtuoso Weblog
4 min readMar 19, 2019
Image Source: BBC Good Food

Overview

Virtuoso’s Sponger Middleware layer generates Linked Data from a variety of disparate web content (HTML, iCal, RDF Documents and statements, etc.).

Objective

This article demonstrates how Virtuoso may be used to extract, transform, and load (ETL) recipe data from BBC Good Food.

The results of this tutorial can be seen by clicking the live links provided throughout.

Requirements

Observing Linked Data Embedded in Recipe Pages

[1] Open a recipe web page from BBC Good Food (example)

[2] If you have installed OSDS, the OpenLink Structured Data Sniffer, in your browser (Chrome, Opera, Firefox), just click on the Sniffer logo in your browser’s toolbar to view a human-friendly version of structured data embedded within the web page. Similar data will be extracted from each recipe web page.

[3] Results for each category can be loaded into your Virtuoso instance by executing the following SPARQL query through the built-in SPARQL interface at <http://{cname[:port]}/sparql>:

DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1
SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>
WHERE
{
?s sioc:links_to ?recipePage .
FILTER
( CONTAINS(str(?recipePage), "/recipes/")
&& !CONTAINS(str(?recipePage), "/category")
&& !CONTAINS(str(?recipePage), "/collection")
&& isIRI(?recipePage)
)
}

If you’re using a SQL environment (isql, ODBC, JDBC, etc.), the same query can be executed using Virtuoso’s SPARQL-Within-SQL (SPASQL) functionality simply by (1) adding the keyword SPARQL to the start of the query, and (2) adding a semicolon to the end:

SPARQL
DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1
SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>
WHERE
{
?s sioc:links_to ?recipePage.
FILTER
( CONTAINS(str(?recipePage), "/recipes/")
&& !CONTAINS(str(?recipePage), "/category")
&& !CONTAINS(str(?recipePage), "/collection")
&& isIRI(?recipePage)
)
} ;

[4] Either SQL or SPARQL execution of the query will extract the embedded metadata from each category landing page, and from the recipes that follow (roughly 300 in total).

[5] Once loaded, the recipes can be viewed through a SPARQL query like this:

SELECT ?recipe 
?name
?source
WHERE
{
?recipe a schema:Recipe ;
schema:name ?name ;
wdrs:describedby ?source .
FILTER
(
CONTAINS(str(?source),"bbcgoodfood")
)
}

[6] We can also view a visual representation of the extracted recipe metadata through Virtuoso’s built-in Faceted Browsing interface, by clicking on the hyperlinks in the recipe column of the SPARQL results page.

[7] Finally, you can use the same query to generate a PivotViewer Report that adds image processing and an animated drill-down to the experience, as illustrated in the screenshots that follow:

Default Page — Live Query
Filter by Recipe Categories (e.g., Lunch and Dinner) — Live Query
Filter by Recipe Calories

Next Steps

[8] Now that the data has been loaded, you can begin querying the newly loaded recipes in depth [Article to be released later this week].

Bonus

I’ll repeat this to extract the entire BBC Good Food recipe set, and share the script if this post receives 100 claps 👏🏾

Related Content

--

--