What’s cooking? Part 1: Importing BBC goodfood information into Neo4j

Introduction

My colleague Mark Needham and I were very keen to get our hands on a new data set: BBC goodfood recipes. Each recipe has a wealth of interesting data from ingredients, diet types, nutrition, and so forth. This gives us a lot of opportunities to try out great graph database use-cases in a fun setting. Before we get started on these use-cases, we need to get that data. In this post we will show you a couple of ways to get that data into Neo4j.

Overview of the situation

Graph databases are a fantastic fit in many business use-cases. We’d suspect that many of you will have come across them, including:

  • Recommendation engines
  • Master Data Management
  • Entity resolution
  • Fraud detection
  • And many more!

Graph databases are a great fit because relationships between data entities are valued the same as the data entities. Not only do you look at the data entity itself, but also it’s context in relation to other entities. There is a wealth of material on this online. For those of you who are interested in learning more, look here.

So we have a great technology and we have some neat examples we want to apply. All we need is that tasty data set to work with. There are a great set of examples ready for you to get your hands on via the Neo4j GraphGists. However, for this post we’ve decided to do something a little different…

BBC goodfood has a fantastic selection of recipes. It has long been an extremely popular resource for for budding chefs, and for those looking for something different at meal times. There is an extensive selection covering different courses, cuisines and dietary requirements.

BBC goodfood seriously rich chocolate cake

Examining one of the recipes (who doesn’t fancy that seriously rich chocolate cake?), we can see we’ve got a lot of useful information such as ingredients, nutrition, user ratings, and so forth.

Investigating the source code for the page, we discover we have a very tidy object that gives us helpful summary information:

{page:{article:{author:"Good Food",description:"Dark, rich and delicious - the perfect dessert",id:"97123",tags:[]},recipe:{collections:["Chocolate cake","Boozy bake"],cooking_time:2100,prep_time:1800,serves:10,keywords:["Cocoa powder","Dark chocolate","Dessert","Decadent","Pudding","Afternoon tea","Booze","Alcohol","Ground almond","Ground almonds","Kirsch"],ratings:81,nutrition_info:["Added sugar 22g","Carbohydrate 24g","Kcal 401 calories","Protein 10g","Salt 0.66g","Saturated fat 11g","Fat 30g"],ingredients:["butter","flour","dark chocolate","egg","ground almond","kirsch","salt","caster sugar","cocoa powder"],courses:["Dessert","Treat","Buffet"],cusine:"British",diet_types:["Low-salt"],skill_level:"More effort",post_dates:"1009843200"},channel:"Recipe",title:"Seriously rich chocolate cake"}}

Which if we unfurl, would look something like this:

{page:{  
article:{
author:"Good Food",
description:"Dark, rich and delicious - the perfect dessert",
id:"97123",
tags:[
]
},
recipe:{
collections:[
"Chocolate cake",
"Boozy bake"
],
cooking_time:2100,
prep_time:1800,
serves:10,
keywords:[
"Cocoa powder",
"Dark chocolate",
"Dessert",
"Decadent",
"Pudding",
"Afternoon tea",
"Booze",
"Alcohol",
"Ground almond",
"Ground almonds",
"Kirsch"
],
ratings:81,
nutrition_info:[
"Added sugar 22g",
"Carbohydrate 24g",
"Kcal 401 calories",
"Protein 10g",
"Salt 0.66g",
"Saturated fat 11g",
"Fat 30g"
],
ingredients:[
"butter",
"flour",
"dark chocolate",
"egg",
"ground almond",
"kirsch",
"salt",
"caster sugar",
"cocoa powder"
],
courses:[
"Dessert",
"Treat",
"Buffet"
],
cusine:"British",
diet_types:[
"Low-salt"
],
skill_level:"More effort",
post_dates:"1009843200"
},
channel:"Recipe",
title:"Seriously rich chocolate cake"
}}

This would be a worth data set to import. We can ask some very interesting questions, such as:

  • What are the most commonly used ingredients?
  • Recommend me recipes based on author/ingredient/diet
  • I have these ingredients, what can I cook?

Now we’ve downloaded some recipes we want to investigate (the onus is on the reader to do this step, including parsing). How do we get this data into Neo4j?

There are a number of mechanisms that allow us to get data in to Neo4j. We’re going to explore two approaches below.

First approach — LOAD CSV

To start off with, we may just want to explore the ingredients for the recipes. That is, we extract the recipes unique ID, title, and the ingredients list. A quick and easy way to get the data into Neo4j would be through using LOAD CSV (Comma Separated Value). Our graph data model will look something like this:

To get the recipe data into an digestible format, we’d suggest the following format of id, title, ingredient. A snippet would look as follows:

"97123","Seriously rich chocolate cake","butter"
"97123","Seriously rich chocolate cake","flour"
"97123","Seriously rich chocolate cake","dark chocolate"
"97123","Seriously rich chocolate cake","egg"
"97123","Seriously rich chocolate cake","ground almond"
"97123","Seriously rich chocolate cake","kirsch"
"97123","Seriously rich chocolate cake","salt""caster sugar"
"97123","Seriously rich chocolate cake","cocoa powder"

Once we’ve got all of the recipe data into this flat file format, we can now load it into Neo4j using Cypher queries. To run this query in Neo4j Browser, you will need to paste each line terminating with a semicolon separately. Alternatively, you can enable the multi-line statement mode (in Browser go to Settings, and tick ‘ Enable multi statement query editor’). Save your file into the import folder, and run the following:

CREATE INDEX ON:Ingredient(value);
CREATE INDEX ON:Recipe(id);
LOAD CSV FROM "file:///bbcgoodfood.csv" AS line
MERGE (r:Recipe {id:line[0]})
ON CREATE SET r.title= line[1]
MERGE (i:Ingredient {value:line[2]})
CREATE (r)-[:CONTAINS_INGREDIENT]->(i)

If you’ve got a particularly large file, you may wish to consider periodic commit.

With some data now in, we can ask some simple questions, such as, what are the most common ingredients?

MATCH (i:Ingredient)<-[rel:CONTAINS_INGREDIENT]-(r:Recipe)
RETURN i.name, count(rel) as recipes order by recipes desc

Or, how about suggesting some other recipes that are similar to that chocolate cake? Here’s a simple query that’ll make some recommendations recipes using mostly the same ingredients:

MATCH (r:Recipe {id:'97123'})-[:CONTAINS_INGREDIENT]->(i:Ingredient)<-[:CONTAINS_INGREDIENT]-(rec:Recipe)
WITH rec, COUNT(*) as commonIngredients
RETURN rec.name, rec.id ORDER BY commonIngredients DESC LIMIT 10

This is great start and there are lots of things to explore with this data. However it would be nice to get the rest of that page object. Also, rather than having to create a verbose CSV file, wouldn’t it be better if we could use that object as is?

Second approach — load.json

This is where the APOC library comes in. APOC (A Package Of Components) is a set of user-defined functions packaged together and can be called by Cypher query. To install APOC on your Neo4j database instance, check here.

We’re going to use the load.json procedure to ingest all of those page objects. Once you have extracted all of the objects into a file, we can now run the following query (also a multi-line statement). If you wish, you may use our import file. You will notice we load the file several times — this is to reduce the eagerness of the data import:

CREATE INDEX ON :Recipe(id);
CREATE INDEX ON :Ingredient(name);
CREATE INDEX ON :Keyword(name);
CREATE INDEX ON :DietType(name);
CREATE INDEX ON :Author(name);
CREATE INDEX ON :Collection(name);
:params jsonFile => "https://raw.githubusercontent.com/mneedham/bbcgoodfood/master/stream_all.json";
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.title AS title,
value.page.article.description AS description,
value.page.recipe.cooking_time AS cookingTime,
value.page.recipe.prep_time AS preparationTime,
value.page.recipe.skill_level AS skillLevel
MERGE (r:Recipe {id: id})
SET r.cookingTime = cookingTime,
r.preparationTime = preparationTime,
r.name = title,
r.description = description,
r.skillLevel = skillLevel;
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.article.author AS author
MERGE (a:Author {name: author})
WITH a,id
MATCH (r:Recipe {id:id})
MERGE (a)-[:WROTE]->(r);
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.recipe.ingredients AS ingredients
MATCH (r:Recipe {id:id})
FOREACH (ingredient IN ingredients |
MERGE (i:Ingredient {name: ingredient})
MERGE (r)-[:CONTAINS_INGREDIENT]->(i)
);
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.recipe.keywords AS keywords
MATCH (r:Recipe {id:id})
FOREACH (keyword IN keywords |
MERGE (k:Keyword {name: keyword})
MERGE (r)-[:KEYWORD]->(k)
);
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.recipe.diet_types AS dietTypes
MATCH (r:Recipe {id:id})
FOREACH (dietType IN dietTypes |
MERGE (d:DietType {name: dietType})
MERGE (r)-[:DIET_TYPE]->(d)
);
CALL apoc.load.json($jsonFile) YIELD value
WITH value.page.article.id AS id,
value.page.recipe.collections AS collections
MATCH (r:Recipe {id:id})
FOREACH (collection IN collections |
MERGE (c:Collection {name: collection})
MERGE (r)-[:COLLECTION]->(c)
);

You may wish to download it and parse it locally. Note that you will need to change the path to be an absolute path to the file on your machine e.g.file:///<absolutepath>/stream_all.jsonThis is an absolute path, not relative one as is the case with LOAD CSV.

:params jsonFile => "file:///<absolutepath>/stream_all.json";

You’ll also need to add the following line to your Neo4j configuration file:

apoc.import.file.enabled=true

And there we have it, a rich and interesting data set to explore within a graph database. Going back to the chocolate cake, what else has the author published?

MATCH (rec:Recipe)<-[:WROTE]-(a:Author)-[:WROTE]->(r:Recipe {id:'97123'})
RETURN rec.name, rec.id

Summary

We’ve shown the principles of loading recipe data into Neo4j, using either CSVs, or load.json. There’s even a simple recommendation query to get you started with the recipes data.

This rich data set gives us even more things to explore. For example, approaches on entity resolution: does cherry tomato belong with tomatoes, or with cherries? Or, can we start creating common recipe communities, such as those for cakes? Some food for thought indeed!