Loading BBC Good Food Data into Virtuoso
Overview
Virtuoso’s Sponger Middleware layer generates Linked Data from a variety of disparate web content (HTML, iCal, RDF Documents and statements, etc.).
Objective
This article demonstrates how Virtuoso may be used to extract, transform, and load (ETL) recipe data from BBC Good Food.
The results of this tutorial can be seen by clicking the live links provided throughout.
Requirements
- Virtuoso Universal Server (either Commercial/Enterprise or Open Source)
- Linked Data Cartridges VAD (Commercial/Enterprise or Open Source)
- Optional: OpenLink Structured Data Sniffer browser extension (Chrome, Opera, Firefox)
Observing Linked Data Embedded in Recipe Pages
[1] Open a recipe web page from BBC Good Food (example)
[2] If you have installed OSDS, the OpenLink Structured Data Sniffer, in your browser (Chrome, Opera, Firefox), just click on the Sniffer logo in your browser’s toolbar to view a human-friendly version of structured data embedded within the web page. Similar data will be extracted from each recipe web page.
[3] Results for each category can be loaded into your Virtuoso instance by executing the following SPARQL query through the built-in SPARQL interface at <http://{cname[:port]}/sparql
>:
DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>WHERE
{
?s sioc:links_to ?recipePage .
FILTER
( CONTAINS(str(?recipePage), "/recipes/")
&& !CONTAINS(str(?recipePage), "/category")
&& !CONTAINS(str(?recipePage), "/collection")
&& isIRI(?recipePage)
)
}
If you’re using a SQL environment (isql, ODBC, JDBC, etc.), the same query can be executed using Virtuoso’s SPARQL-Within-SQL (SPASQL) functionality simply by (1) adding the keyword SPARQL
to the start of the query, and (2) adding a semicolon to the end:
SPARQL
DEFINE get:soft "replace"
DEFINE input:grab-var "?recipePage"
DEFINE input:grab-depth 1SELECT ?recipePage
FROM <https://www.bbcgoodfood.com/recipes/collection/american>
FROM <https://www.bbcgoodfood.com/recipes/collection/british>
FROM <https://www.bbcgoodfood.com/recipes/collection/caribbean>
FROM <https://www.bbcgoodfood.com/recipes/collection/chinese>
FROM <https://www.bbcgoodfood.com/recipes/collection/french>
FROM <https://www.bbcgoodfood.com/recipes/collection/greek>
FROM <https://www.bbcgoodfood.com/recipes/collection/indian>
FROM <https://www.bbcgoodfood.com/recipes/collection/italian>
FROM <https://www.bbcgoodfood.com/recipes/collection/japanese>
FROM <https://www.bbcgoodfood.com/recipes/collection/mediterranean>
FROM <https://www.bbcgoodfood.com/recipes/collection/mexican>
FROM <https://www.bbcgoodfood.com/recipes/collection/moroccan>
FROM <https://www.bbcgoodfood.com/recipes/collection/spanish>
FROM <https://www.bbcgoodfood.com/recipes/collection/thai>
FROM <https://www.bbcgoodfood.com/recipes/collection/turkish>
FROM <https://www.bbcgoodfood.com/recipes/collection/vietnamese>WHERE
{
?s sioc:links_to ?recipePage.
FILTER
( CONTAINS(str(?recipePage), "/recipes/")
&& !CONTAINS(str(?recipePage), "/category")
&& !CONTAINS(str(?recipePage), "/collection")
&& isIRI(?recipePage)
)
} ;
[4] Either SQL or SPARQL execution of the query will extract the embedded metadata from each category landing page, and from the recipes that follow (roughly 300 in total).
[5] Once loaded, the recipes can be viewed through a SPARQL query like this:
SELECT ?recipe
?name
?source
WHERE
{
?recipe a schema:Recipe ;
schema:name ?name ;
wdrs:describedby ?source .
FILTER
(
CONTAINS(str(?source),"bbcgoodfood")
)
}
[6] We can also view a visual representation of the extracted recipe metadata through Virtuoso’s built-in Faceted Browsing interface, by clicking on the hyperlinks in the recipe column of the SPARQL results page.
[7] Finally, you can use the same query to generate a PivotViewer Report that adds image processing and an animated drill-down to the experience, as illustrated in the screenshots that follow:
Next Steps
[8] Now that the data has been loaded, you can begin querying the newly loaded recipes in depth [Article to be released later this week].
Bonus
I’ll repeat this to extract the entire BBC Good Food recipe set, and share the script if this post receives 100 claps 👏🏾
Related Content
- Things You Can’t do Without Virtuoso
- Conceptual Relational Data Virtualization, using Existing Open Standards
- Reducing Multiple CSV Documents to a Table in Virtuoso
- OpenLink ODBC Drivers Home Page
- Virtuoso Home Page
- Free Evaluation and Download Page — for Windows, Linux, and macOS
- Free Evaluation License for Windows
- Free Evaluation License for Linux
- Free Evaluation License of macOS
- Current Entry-Level Offers across Linux, Windows, and macOS
- Virtuoso Pay-As-You-Go (PAGO) Edition from Amazon Web Services (AWS) Cloud