Loading BBC Good Food Data into Virtuoso

Image Source: BBC Good Food

Overview

Virtuoso’s Sponger Middleware layer generates Linked Data from a variety of disparate web content (ex: HTML, iCal, RDF Documents and statements, etc.).

Objective

This article demonstrates how to use Virtuoso for extracting, transforming, and loading (ETL) recipe data from BBC Good Food.

The final product of this tutorial can also be viewed using the live links provided throughout.

Requirements

  • Virtuoso Universal Server (Commercial/Open Source)
  • Linked Data Cardtridges VAD
  • Optional: OpenLink Structured Data Sniffer (Chrome, Firefox)

Observing Linked Data Embedded into Recipe Pages

  1. Open a recipe web page from BBC GoodFood (Example)

2. If you have the OpenLink Structured Data Sniffer installed (Chrome, Firefox), click on the plug-in logo to view the human-friendly version of embedded structured data within the web page. This data is what will be extracted from each recipe web page.

The populated results for each category can be loaded into your Virtuoso instance using the following SPARQL query via the built-in SPARQL interface at

DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1
SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>
WHERE
{
?s sioc:links_to ?recipePage.
FILTER(CONTAINS(str(?recipePage),"/recipes/") && !CONTAINS(str(?recipePage),"/category") && !CONTAINS(str(?recipePage),"/collection") && isIRI(?recipePage))
}

If you’re using a SQL environment, this can be executed using Virtuoso’s SPARQL-Within-SQL (SPASQL) functionality by entering

SPARQL
DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1
SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>
WHERE
{
?s sioc:links_to ?recipePage.
FILTER(CONTAINS(str(?recipePage),"/recipes/") && !CONTAINS(str(?recipePage),"/category") && !CONTAINS(str(?recipePage),"/collection") && isIRI(?recipePage))
}

This query will extract the embedded metadata of the landing pages on each landing page, and the recipes that follow (roughly 300).

Once loaded, the recipes can be viewed by running:

SELECT ?recipe 
?name
?source
WHERE
{
?recipe a schema:Recipe;
schema:name ?name;
wdrs:describedby ?source.
FILTER(CONTAINS(str(?source),"bbcgoodfood")).
}

We can also view a visual representation of the extracted recipe metadata through Virtuoso’s built-in Faceted Browsing interface, by clicking on the result hyperlinks in the recipe column.

Finally, you can also use the same query to generate a PivotViewer Report that adds image processing and animated drill-down to the experience, as illustrated in the screenshots that follow.

Default Page — Live Query
Filter by Recipe Categories (e.g., Lunch and Dinner) — Live Query
Filter by Recipe Calories

Next Steps

Now that the data has been loaded, you can begin querying the newly loaded recipes in depth [Article to be released later this week].

Bonus

I’ll repeat this to extract the entire BBC Good Food recipe set, and share the script if this post receives 100 claps 👏🏾

Related Content

OpenLink Virtuoso Weblog

News & Articles related to OpenLink Virtuoso & Related Technologies

Daniel Heward-Mills

Written by

Technical Specialist @ OpenLink Software: https://www.linkedin.com/in/daniel-heward-mills-a0940465/

OpenLink Virtuoso Weblog

News & Articles related to OpenLink Virtuoso & Related Technologies

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade