Ljubica Lazarevic
Jun 25 · 13 min read
Photo by Scott Webb on Unsplash

You can read the previous parts of the series to get up to date:

Part 1: Importing BBC goodfood information into Neo4j

Part 2: What can I make with these ingredients?

Part 3: A segue into graph modelling

Part 4: Similarities

Introduction

Welcome back, dear foodies! We’ve done some interesting work on our BBC goodfood journey:

  • We’ve explored the BBC goodfood data set and thought about how we might model and import it
  • We’ve thought about how we can start to query that data set, leveraging relationships between ingredients and recipes to find potential recommendations
  • We’ve started looking at similarities between recipes based on common ingredients

However, we can avoid it no longer! We have some data quality issues we need to address, namely duplicate ingredients.

In this post we’re going to:

  • Look at the kinds of challenges we face when dealing with duplicates
  • Investigate some approaches we can use to identify the duplicate ingredients
  • Introduce potential approaches that might be useful when thinking about entity resolution

This is by no means a set of exhaustive approaches. There will undoubtedly be more and different ways to attack this challenge. However, you will be armed with some ideas!

Overview of the situation

Let’s have a look at an example. I fancy some almonds about now, so why not investigate that:

//All the ingredients that relate to almonds
MATCH (i:Ingredient)
WHERE i.name CONTAINS 'almond'
RETURN i.name as Almonds ORDER BY Almonds

Which returns the following:

Now whilst there’s a big difference between almond butter and almond milk, we do have a situation where we’re dealing with:

  • Plurals: e.g. almond and almonds
  • Active versus passive tense: e.g. flaked almond and almond flakes
  • Additional words: e.g. blanched almond and whole blanched almonds

And I suspect there will be some more hurdles we’ll need to navigate down the road!

As a check point, let’s see how many total ingredients we currently have, duplicates and all. Running:

MATCH (n:Ingredient)
RETURN COUNT(*)

tells us we have 3077 ingredients. Let’s see what we’re left with after a bit of work.

Duplicate resolution versus entity resolution

The other interesting aspect we’re going to touch on here is entity resolution. Whilst we’ll cover this in more depth in a later post, we will have some opportunities in this current journey to address some obvious candidates. So what might we mean by this?

https://unsplash.com/photos/iASD5_HpTZc

For example, I think we can all comfortably agree that almond and almonds are probably one and the same, and it’s a duplicate of sorts. And we’d also say the same around cherry tomato and cherry tomatoes.

But let’s look more closely at cherry tomato… is it a cherry, or is it a tomato? Ah — now we step into the interesting world of entity resolution. There may be many reasons why we may wish to determine whether cherry tomato is a tomato or a cherry (or just embrace it as an entity in it’s own right!):

  • We’re thinking about replacement ingredients — if we don’t have any cherry tomatoes, but we happen to have some vine tomatoes in the pantry, we don’t want to give up on that lunch we’d set our hearts on
  • We’re thinking about better understanding recipe similarity (such as Mark’s post on similarity) — we’ll better understand any recipes lurking under different names
  • We want to build a taxonomy of ingredients — vine tomato, cherry tomato, plum tomato — they’re all tomatoes

There will, undoubtedly, be much conversation around what entity resolution is and isn’t:

  • one view point may well be that almond and almonds is entity resolution (and not a duplicate), and that cherry tomato is not a vine tomato
  • another view point may well be that all tomatoes are a tomato.

Whilst the view of how we deal with cherry tomato will be a topic for a future post, defining the specifics will inevitably come down to the business rules.

Getting started — it’s all about the business rules…

To be able to think about how we’re going to resolve the duplicates we have, we have to make a number of decisions around how we classify why two entities are duplicates of each other. Some of these will be obvious data quality challenges, for example erroneous special characters, or extra padding provided by white spaces. Some are slightly less obvious, such as plural versus singular, or passive versus active tense. Some will be hard to spot and will rely on context, e.g. Is a Cherry Tomato a type of Cherry or a type of Tomato?

These decisions that we take and make will be based on our knowledge of the domain we are working in. There will be other domains where we cannot make assumptions on plural words, for example! These set of decisions and requirements we pull together to determine how we handle duplicates are commonly known as business rules — i.e. the business decisions being made on how technology will be applied.

So, let’s go about defining some business rules. For now, we’re going to leave the Cherry Tomato dilemma — we’ll have a look at some approaches we can use in the next post. Having a look at the data, here are some examples we’re defining as the rules to clean the data:

  • plurals refer to the same thing: e.g. tomato, tomatoes
  • The different plural types: cherry, cherries, tomato, tomatoes, onion, onions
  • Dealing with active versus passive: flaked almonds, almond flakes
  • Dealing with capitalisation issues — we will lower-case everything
  • Dealing with interesting spellings: We’ll try and match things together that ‘sound’ the same, e.g. herbs and herbes
  • Dealing with punctuation: we’ll treat hyphens as spaces
  • Dealing with stop words: we’ll assume they’re not important for the ingredient name

The other rule we’re going to apply is that we don’t mind about the word ordering in the ingredients, e.g. almond flakes, flaked almonds, or light brown soft sugar, and light soft brown sugar.

Whilst this is not an exhaustive list of business rules, this should help us think about some of the approaches we can do to automate de-duplicating our ingredients.

Dealing with duplicates — tokenising the ingredients

To make our lives easier, it would make sense to tokenise the ingredient words, for example, split out cherry tomato into two words. We’ll create nodes for this, and we’ll have something looking like this:

This makes it significantly easier to deal with the various aberrations we described above. ‘How so?’ I hear you ask! Tokenising the ingredient allows us to:

  • Forget about word ordering: ‘ginger and garlic paste’ and ‘garlic and ginger paste’ can now be treated the same
  • Tackle plural words that may appear in different parts of the Ingredient name
  • Make it generally easier to de-duplicate Ingredients by de-duplicating constituent components

So let’s get started! First of all, we have some weird character import going on (due to improper encoding), so let’s fix that:

MATCH (i:IngredientName) 
WHERE i.name CONTAINS("’") OR i.name CONTAINS("‘") OR i.name CONTAINS("
")
SET i.name = replace(replace(replace(i.name,"’", "'" ), '‘',"'"), "
", "")

We also have some errant characters, so we’ll remove those too:

MATCH (i:Ingredient)
WHERE i.name CONTAINS('(') OR i.name CONTAINS(')')
SET i.name = replace(replace(i.name, "(", ""), ")", "");

Now we’ve cleaned up those initial data issues it’s time to tokenise. The code to do this:

MATCH (i:Ingredient)
WITH i, split(lower(i.name), ' ') AS names
FOREACH (n IN names|
MERGE (in:IngredientName {name:n})
MERGE (in)-[:IS_COMPONENT_OF]->(i)
);

What this will do is take each ingredient, split the name by space, and then create a new node for each component, and connect it back to the original ingredient name. Note that we’re using MERGE, which means we’ll reuse any existing components. Also, we’re using lower() to remove any issues around case.

We also have some hyphenated strings, so let’s deal with those too:

MATCH (i:IngredientName)-[:IS_COMPONENT_OF]->(in)
WHERE i.name CONTAINS '-'
WITH i, in, split(i.name, '-') AS names
FOREACH (n IN names|
MERGE (i2:IngredientName {name:n})
MERGE (i2)-[:IS_COMPONENT_OF]->(in)
)
DETACH DELETE i;

A slight difference from the previous query. Here we want to delete the hyphenated entity once we’ve processed it (which we do with DETACH DELETE i).

So that’s the basic tokenisation done. Now we’re going to do some processing on IngredientName to help us with our de-duplicating efforts. It’s worth mentioning there are many approaches we could be using to do the cleaning we’re about to do: you may be using an ETL tool; or programatically cleaning the data. As we’re doing this work in Neo4j, I’ve decided that’s where I’ll do the cleaning.

Firstly, let’s get rid of some stop words, such as in, and, etc. We’re not going to be very precise with this, we’ll assume anything that is 2 characters in length is a stop word:

MATCH (i:IngredientName) 
WHERE length(i.name) <3
DETACH DELETE i;

And ditch some other ones, such as and, with etc.

MATCH (i:IngredientName) 
WHERE i.name IN ['and', 'the', 'this', 'with']
DETACH DELETE i;

Now we’re going to deal with plurals. We are assuming those are going to be +s, +es, +oes. We’ll cover +ies shortly:

MATCH (i1:IngredientName), (i2:IngredientName)
WHERE id(i1)<>id(i2)
AND (i1.name+'s' = i2.name OR
i1.name+'es'=i2.name OR
i1.name+'oes'=i2.name)
WITH i1, i2
MATCH (i1)-[:IS_COMPONENT_OF]->(in1:Ingredient),
(i2)-[:IS_COMPONENT_OF]->(in2:Ingredient)
MERGE (i1)-[:IS_COMPONENT_OF]->(in2)
DETACH DELETE i2;

Something to bear in mind — this is not a particularly graphy query to begin with. It’s essentially a Cartesian query, so at some point there may well be too much data to compare against, and we would have to think about how to do this in a different way. The WHERE id(i1)<>id(i2) part stops us from comparing the same node against itself.

Now we need to do a little more ‘heavy’ lifting, we’re going to do a bit of string manipulation to tackle +ies plurals, and active versus passive. This is quite expensive, and the above approach may cause us to run into problems. To handle this we’re going to reduce the initial data set that we want to compare against. For this I’ve decided to use Sorensen Dice Similarity.

Sorensen Dice Similarity Formula

Sorensen Dice Similarity is an approach to determine how similar two samples are.

Here we use the APOC implementation as a way of fuzzy-matching text:

MATCH (n1:IngredientName),(n2:IngredientName)
WHERE id(n1) <> id(n2)
WITH n1, n2,
apoc.text.sorensenDiceSimilarity(n1.name,n2.name) as sorensenDS
WHERE sorensenDS > 0.6 AND left(n1.name,2)=left(n2.name,2)
with n1, n2
WHERE length(n1.name) <> length(n2.name)
AND (left(n1.name, length(n1.name)-1)+'ies' = n2.name OR
n1.name+'d' = n2.name OR
left(n1.name, length(n1.name)-1)+'d' = n2.name)
WITH n1, n2
MATCH (n2)-[:IS_COMPONENT_OF]->(i)
MERGE (n1)-[:IS_COMPONENT_OF]->(i)
DETACH DELETE n2;

We’re also checking the start of both words to make sure that they start with the same characters, further cutting down the amount of data we’re going to check. A word of warning; we need to specify some sort of threshold for Sorensen Dice Similarity. In this example we’re trying to cut down how much data we bring in to query. There may be other situations where the temptation to overfit based on the sample data we’re using could cause unexpected outputs later. Do keep an eye out for that.

Last but not least, we’re now going to try and mop up the last few rogue items. There are misspellings/different spellings, such as herbs and herbes. We’re going to use APOC again, this time using Double Metaphone. Double Metaphone is a type of phonetic algorithm used for indexing words based on their sounds.

Again, we’re using Sorensen Dice Similarity to cut down how much data we pull in:

MATCH (n1:IngredientName),(n2:IngredientName)
WHERE id(n1) < id(n2)
WITH n1, n2,
apoc.text.sorensenDiceSimilarity(n1.name,n2.name) AS sorensenDS
WHERE sorensenDS > 0.92
CALL apoc.text.doubleMetaphone([n1.name, n2.name]) YIELD value
WITH n1, n2, collect(value) AS val
WHERE val[0] = val[1]
WITH n1, n2
MATCH (n2)-[:IS_COMPONENT_OF]->(i:Ingredient)
MERGE (n1)-[:IS_COMPONENT_OF]->(i)
DETACH DELETE n2;

For those of you who are decomposing the queries to see the inner workings, you may well have noticed that words such as con and cone are being classed as the same. We have to decide whether this level of aggressive ‘de-duplication’ is acceptable or not, which will again come back to business rules. In this scenario, I’m taking the decision that it doesn’t matter, as this is on tokenised words, and we’re leaving the original words in tact, and it may well be the case that con and cone are for completely different ingredients. In this case the rate of intersection via IngredientName will be so low, we can forget about it.

First pass de-duplication

So, let’s have a look what this tokenising and clean up work has done for us so far. If we now use our cleaned up tokens, we can now match the following:

MATCH (i:Ingredient)
WITH i, [(i)<-[:IS_COMPONENT_OF]-(in:IngredientName) | in] AS components
MATCH (i)-[:IS_COMPONENT_OF*2]-(i2)
WHERE i.name < i2.name
WITH DISTINCT i, components, i2
WHERE size((i2)<-[:IS_COMPONENT_OF]-()) = size(components)
AND all(in IN components WHERE (in)-[:IS_COMPONENT_OF]->(i2))
RETURN i.name, collect(i2.name)

Which gives us something like this:

Not bad! That’s 358 duplicate ingredients we’ve found already (replace the last line with RETURN count(*) ). That’s just by tokenising the ingredient names, doing some clean-up, and then finding intersection matches.

One challenge we still need to overcome. We haven’t quite solved our previous issue where there’s an extra word, e.g. ‘gluten free white flour’/’gluten free flour’. Let’s have a look at how we might approach that next.

Introducing community detection for detecting duplicates

Mark introduced the Jaccard algorithm to find similarity between recipes. We’re going to use it again, this time to build relationships between ingredients, determining similarity based on their relationships to IngredientName. Those tokenised words are definitely very useful!

Once we’ve added more structure to the Ingredients, we’re then going to use Louvain to determine ingredient groupings. In this context, this will hopefully help join together all the ingredients that are the same, even with the odd missing word.

Louvain is a community detection algorithm. Based on the structure of the graph and how entities are connected, it will attempt to map specified node types to specific groups, if they are deemed to have similar connectivity. In this example, Ingredients grouped together are assumed to be the same ingredient.

In true Blue Peter style — here’s one I prepared earlier — The node in the centre is the IngredientName node, and the nodes with numbers are the identified communities

We won’t go over Jaccard again, since that was covered previously. What we do need to decide is what our cutoff value is going to be:

  • Too high and we’ll miss some true duplicates
  • Too low and we’ll treat non-duplicates as duplicates.

Again, we still need to be aware of the overfitting problem.

Here’s the Jaccard query:

MATCH (p:Ingredient)<-[:IS_COMPONENT_OF]-(n)
WITH {item:id(p), categories:collect(id(n))} as itemsList
WITH collect(itemsList) as itemsToProc
CALL algo.similarity.jaccard(itemsToProc, {
writeRelationshipType: 'SIMILAR_TO',
similarityCutoff: 0.8, write:true
})
YIELD nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean
RETURN nodes, similarityPairs, write, writeRelationshipType, writeProperty, min, max, mean;

Where a SIMILAR_TO relationship will be joined between Ingredients that meet the cutoff value. We’re going to tell Louvain to work with the Ingredient node, and to use the SIMILAR_TO relationship type to determine the communities:

CALL algo.louvain.stream('Ingredient', 'SIMILAR_TO', {}) 
YIELD nodeId, community
WITH algo.getNodeById(nodeId) AS ingredient, community
MERGE (e:Entity {id:community})
MERGE (e)-[:CONTAINS_MEMBER]->(ingredient);

Here we yield a community ‘id’ and the node (via its internal id) that belongs to it. We in turn take that, create a new Entity node, and merge all the associated Ingredient nodes to it.

We can try experimenting with the Jaccard cutoff value. Trying a cutoff value of 0.7 gives us:

Jaccard Similarity with a cut off of 0.7

whereas a cutoff value of 0.8 gives us:

Jaccard Similarity with a cut off of 0.8

What should be clear here is that we’re approaching the borders of entity resolution. Depending on what is our definition of entity resolution, we might want to categories all types as sugar in one category, or we may not. This will all come down to what is acceptable and what our business rules are.

There are a few erroneous results with a lower cutoff value, and they’re not unreasonable. As we’re still not looking at Ingredients with full context, there’s no way to know that the colour of the muscovado sugar might be important. Or that explicitly stating that gluten-free flour is white is not necessary. Also, don’t forget about the overfitting challenge as well.

I’m going to leave it here for now — we’ve covered a lot of ground around. Some approaches we might use to tackle duplicates, we’ve hinted at some options around entity resolution. Don’t worry, we’re going to be back soon looking at some more ‘graphy’ ways to tackle that subject.

Looking back at where we’ve got to:

  • with ‘pure’ cleaning only, we’ve managed to bring down the number of ingredients from 3077 to 2719
  • using more fuzzy approaches, depending on the cutoff thresholds, we’ve further brought that number down to 2673

Summary

We’ve shown you in this post some principles around how we might prepare and then look for potential duplicates in our data set. We’ve highlighted the importance of determining what the business rules are to enable us to programatically implement them. Last but not least, we’ve also touched on thinking about the overfitting problem.

By tokenising ingredients and building out more structures, we have started to explore ‘graphy’ ways to find duplicates. It is also starting to come together as to how we might do more extensive entity resolution.

But our work here is still not finished. In the next post we’re going to have a look at how might we bring some more context in based on the data that we have and see if we can do even better.

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.

Ljubica Lazarevic

Written by

Exploring the world of connected data

Neo4j Developer Blog

Developer Content around Graph Databases, Neo4j, Cypher, Data Science, Graph Analytics, GraphQL and more.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade