Neo4j For Recommendation Use Case

24 min readAug 20, 2023

A Beginner-friendly introduction and hands-on demo to Neo4j-based Recommendation Use Case

Who is this gentleman smiling on the cover page? What is his role in data science, especially in recommendations? We’ll get our answers by the time you finish reading this publication!

What is common to a customer signing into E-commerce platforms, Video Portal, Podcast forums, etc.? First, they buy or consume the content; Second, they all receive recommendations on what they can try. How does the system know what to recommend?

Neo4j, a Graph Database, has many features that excel in servicing recommendation use cases. In this write-up, with live examples, we will see how Neo4j helps explore potential recommendation options for CeleBuy, our fictitious E-Commerce platform exclusively used by Hollywood celebrities!!

Part 1

In Part 1 of this multi-part write-up, we will focus on the Data model and creation of Graph Data set for our CeleBuy Portal. This will become the foundation for our explorations in the subsequent parts.

There are compelling algorithms available off the shelf in Neo4j. However, we can achieve many objectives even with basic Cypher queries. To start with, we’ll focus on those Cypher query patterns. By the way, Cypher is the language used to interact with Neo4j. By the time you finish reading this article, in addition to knowledge of recommendation systems, you’ll also get to know how to write basic Cypher queries as an added benefit!

Neo4j represent data using Labels, Nodes & Relationship. It is not a prerequisite to have an in-depth knowledge of Nodes, Relationships, Cypher or Data modelling in Neo4j. However, At any point in time, if you prefer to get a better grasp on these topics, my previous write-up will be helpful! It has numerous examples.

Neo4j For Beginners

Learn Graph Database, Neo4j, Cypher Query Language, Data Modelling in Neo4j and Index Free Adjacency with examples!!

medium.com

There are two modes to read this write-up:

Newspaper mode: Read the content on any device, like how you read a newspaper, enjoy the business problems and learn how we solve them in Neo4j!

Explorer mode:

Log in to a Neo4j Aura instance. At the time of writing, One Cloud instance is free; no credit cards are involved while registering the account.
While reading the content, you will come across my code snippets. Copy and paste the code, and you can visually experience the whole thing yourself, from data set creation to finding answers.
The code is indented in a way, it easier to understand the logical components and their sequence to get to the Result.
Borrowing the Inductive Chain Learning approach, in these examples, when you solve one problem using Cypher, you will learn a technique that will help you solve the following problem.

CeleBuy, For the Shining Stars !!

Let’s start with the Data Model :

We have Customer Nodes and Brand Nodes.
The residence city of customers is stored inside the node property “address”
The type of the brand is stored inside “category”. For the demo, only luxury brands currently use this property.
Customer and Brand Nodes are connected over a “BOUGHT” Relationship.

Customer First

Let’s update all our shining stars and their address using Cypher.

If you find something inside the code starting with a double slash (//), that is not part of the code. It is a comment I have added for better clarity! Feel free to copy and paste the full block. Neo4j will automatically ignore content after // till the end of that line ;-)

// CREATE statement used to create nodes
// Use : followed by Label Name 
// Inside {}, provide Node Property Keys & Property Values

CREATE
    (:Customer{name:'B Pitt',address:'LA'}),
    (:Customer{name:'Ben Affleck',address:'LA'}),
    (:Customer{name:'G Clooney',address:'LA'}),
    (:Customer{name:'J Lopez',address:'LA'}),
    (:Customer{name:'Jimmy Kimmel',address:'LA'}),
    (:Customer{name:'K Smith',address:'LA'}),
    (:Customer{name:'M McConaughey',address:'LA'}),
    (:Customer{name:'Al Pacino',address:'NY'}),
    (:Customer{name:'J Roberts',address:'NY'}),
    (:Customer{name:'Matt Damon',address:'NY'}),
    (:Customer{name:'C Eastwood',address:'SF'}),
    (:Customer{name:'Ryan G',address:'SF'});

Let’s check what’s in our Neo4j Database now.

// MATCH (variable:Label) wil find nodes of that label and save it in variable
// RETURN command is how Neo4j responds back information in a Graph or Table

MATCH
    (n:Customer)
RETURN
    n

What we see above is a Graph View. Let’s get a Table View

// When we return a Result in a X AS Y. This is called ALIAS
// All values of X will be returned under column header Y.
// After the result is returned, ORDER BY will sort result
// When ORDER BY column i , column j data is sorted first using column i
// Inside the rows that contain column i, now data is sorted using column j 

MATCH 
    (a:Customer)
RETURN
    a.name AS CustomerName , 
    a.address AS CustAddress
ORDER BY
    CustAddress , CustomerName

Brand Next

It’s time to upload the list of Brands available on the CeleBuy Platform

CREATE
    (:Brand{name:'Cartier EarRing',category:'luxury'}),
    (:Brand{name:'Chopard',category:'luxury'}),
    (:Brand{name:'Tiffany PearlN',category:'luxury'}),
    (:Brand{name:'Brixton Hat'}),
    (:Brand{name:'Dunkin D'}),
    (:Brand{name:'FiverTree'}),
    (:Brand{name:'M&G BeachTowel'}),
    (:Brand{name:'Martini'}),
    (:Brand{name:'NikeAir'}),
    (:Brand{name:'RayBan'}),
    (:Brand{name:'RemiTypeWrtr'}),
    (:Brand{name:'Sg Sunscr'});

Let’s get a Table view of all Brands loaded in our Database

MATCH
     (n:Brand)
RETURN 
    n.name AS BrName  , 
    n.category AS BrCateg 
    ORDER BY 
        BrCateg

Let’s take a Graph View of all nodes created so far

// The symbol | is called Pipe Symbol
// MATCH (n:X|Y) will match nodes belonging to either X or Y Label

MATCH
     (n:Customer|Brand)
RETURN
     n

Connecting Customers to Brands

I’ve captured the purchase pattern of our celebrities in CeleBuy, using some plain old traffic signals:

It’s time to build the Relationship inside our Graph using Cypher.

For this demo, let us set up a single Customer to Brand relationship, i.e. if the Customer buys the same Brand “n” number of times, we’ll not have “n” number of Relationship arrows flowing from the Customer to the Brand. We will have only one arrow between them.

This exercise is not to find the number of times a person buys. It is more about whether a customer buys, what brands they buy, and what Brand recommendations we can make from a data set.

MATCH 
    (c1:Customer{name:'Al Pacino'}),(c2:Customer{name:'B Pitt'}),(c3:Customer{name:'Ben Affleck'}),
    (c4:Customer{name:'C Eastwood'}),(c5:Customer{name:'G Clooney'}),(c6:Customer{name:'J Lopez'}),
    (c7:Customer{name:'J Roberts'}),(c8:Customer{name:'Jimmy Kimmel'}),(c9:Customer{name:'K Smith'}),
    (c10:Customer{name:'M McConaughey'}),(c11:Customer{name:'Matt Damon'}),(c12:Customer{name:'Ryan G'}),
    (b1:Brand{name:'Brixton Hat'}),(b2:Brand{name:'Cartier EarRing'}),(b3:Brand{name:'Chopard'}),
    (b4:Brand{name:'Dunkin D'}),(b5:Brand{name:'FiverTree'}),(b6:Brand{name:'M&G BeachTowel'}),
    (b7:Brand{name:'Martini'}),(b8:Brand{name:'NikeAir'}),(b9:Brand{name:'RayBan'}),
    (b10:Brand{name:'RemiTypeWrtr'}),(b11:Brand{name:'Sg Sunscr'}),(b12:Brand{name:'Tiffany PearlN'})
CREATE
    (c1)-[:BOUGHT]->(b9),(c3)-[:BOUGHT]->(b4),(c3)-[:BOUGHT]->(b8),(c3)-[:BOUGHT]->(b9),
    (c4)-[:BOUGHT]->(b11),(c5)-[:BOUGHT]->(b5),(c5)-[:BOUGHT]->(b7),(c6)-[:BOUGHT]->(b6),
    (c6)-[:BOUGHT]->(b9),(c6)-[:BOUGHT]->(b11),(c7)-[:BOUGHT]->(b3),(c8)-[:BOUGHT]->(b4),
    (c9)-[:BOUGHT]->(b1),(c9)-[:BOUGHT]->(b9),(c10)-[:BOUGHT]->(b4),(c11)-[:BOUGHT]->(b9),
    (c11)-[:BOUGHT]->(b11),(c12)-[:BOUGHT]->(b9),(c12)-[:BOUGHT]->(b11);

Let’s check the status of our Graph for nodes with a relationship, i.e. Green Traffic lights from our previous table!

MATCH
    (c)-[r:BOUGHT]->(b)
RETURN c,r,b

Let’s check the status of our graph, including the nodes that do not have a relationship, i.e. both Green & Red traffic lights!

MATCH 
    (n:Customer|Brand)
MATCH
    (c)-[r:BOUGHT]->(b)
RETURN n,c,r,b

We are now all set to explore the recommendation use cases.

Part 2

In Part 2, we’ll focus on the 1 Dimension Relationship approach, i.e. only the bought relationship between Customer & Product without considering relationship that exist between the Customers themselves.

From now on, I’ll use the word “User Story” to define the ask from the Brand, Marketing or Sales Team in simple language.

Let’s start with simple user stories to warm up and slowly and steadily increase the complexity in every subsequent attempt.

Reminder: If you find something inside the code starting with a double slash (//), that is not part of the code. It is a comment written for better clarity. Neo4j will automatically ignore content after // till the end of the line ;-)

User Story ID # 01 : Find who is buying a specific Brand

Reason: This is the most basic check by any Sales & Marketing team to get a sense of the type or demographics of customers for their brands

Let’s find out who is buying the Chopard Watches.

// Filter using the name property for the Brand

MATCH (c)-[r:BOUGHT]->(b)
WHERE
  b.name = 'Chopard'
RETURN c,r,b

Julia Roberts is endorsing the Brand and buying Chopard for her personal use. Good to Know ;-)

Key Takeaway:
We have used the Brand name as a filter. One can use
— Any attribute on an individual basis to filter results.
— Use ‘AND’ to filter results based on several criteria, and all of them are met.
— Use ‘OR’ to filter results based on several criteria if any of the criteria is met

User Story ID # 02: Searching for Customers when the case and spelling in the Brand data are ambiguous.

Reason: Depending on the existing quality of data validation during order entry, it is possible to have information entered in many different ways. That should not act as a limitation to perform our search.

Let’s find out customers who bought Dunkin Donuts. There is a small problem, though! We are not sure if the Database has the word written as “Dunkin” or “dunkin” or “DuNkin” or whatever. The only thing we are confident of is that the first two letters of the Brand are ‘d’ followed by a ‘u’. Such a pity!

//convert name of all brands to full lower case
// check the converted text if it starts with 'du'

MATCH (c)-[r:BOUGHT]->(b) 
WHERE 
    toLower(b.name)
      STARTS WITH 'du'
RETURN c,r,b

Oh, that’s cool. So many celebrities have a sweet tooth. Even the Interstellar star. Alright! Alright! Alright!!

Key Takeaway:
There are several Text / String manipulation capabilities available in Neo4j. Some examples
— Use ‘to Lower()’ and ‘toUpper()’ to convert to all lower case or upper case, respectively, before further checks
— Use ‘STARTS WITH’ and ‘ENDS WITH’ to find text that starts with or ends with, respectively
— Use ‘CONTAINS’ to find a given sequence of text available within names of text in the database

User Story ID # 03: Find Brands that customers are yet to buy

Reason: There could be many reasons why a product is non-moving. The price is high; the product is irrelevant to a particular market, is not well positioned or not presented to the right audience etc. Whatever the reason, knowing the list of non-moving products is crucial. CeleBuy can create custom campaigns to improve sales by recommending these brands. An unsold finished good, either in a factory or warehouse of a manufacturer or third party, is locking capital that can otherwise help in cash flow or be used to produce a fast-selling item.

// Result based on a missing pattern

MATCH (b: Brand) 
WHERE
 NOT 
  ()-[:BOUGHT]->(b)
RETURN b

That’s an interesting result.
— In the world of Artificial Intelligence, it is normal for people not to stock up on a Remington Typewriter. However, if the Brand team, on further inspection, realize it is an 1878 Remington Model 2, it’s a collector item and can develop an appropriate campaign.
— Both Tiffany Pearl Necklace and the Cartier EarRings are current items. Maybe they need to be presented to someone with a taste for luxury. Hold on to that thought for now!

Key Takeaway:
— We used a ‘NOT’ command to find nodes that do not have certain patterns. That’s a pretty impressive feature.

User Story ID # 04: Find customers who are yet to buy any Brand

Reason: It’s important to incentivize registered customers to start buying. If not, this might lead to the cancellation of our premium CeleBuy platform subscription. Isn’t it?

MATCH (c:Customer) 
WHERE
 NOT
  (c:Customer)-[:BOUGHT]->()
RETURN c

Ok, Brad! Like the first rule of Fight Club, Let’s keep this a secret ;-) Jokes apart, the CeleBuy team knows who’ll need some persuasion to start purchasing on the platform.

Key Takeaway:
— The approach is similar to the previous negative pattern matching. The only difference is we are trying to find the Customer instead of brands.

User Story ID # 05: Display the most popular ’n’ number of Brands at platform level to all customers

Reason: It is one of the most common recommendation techniques to display Top ’n’ Brands sold to all customers when they login to an E-comm platform. One can call it the Fear Of Missing Out (FOMO) feeling or Friending the Trend ;-) Or whatever, this is a basic requirement.

Let’s find the Three most popular items on our platform now.

// COUNT as the name suggest will count the target item
// AS is otherwise called ALIAS that can be used as Result column header or in calculation
// ORDER BY will Order the result based on specific result column
// DESC will Order the result based on Descending
// LIMIT n will limit the number of entries in result to n number of entries

MATCH 
 (c)-[:BOUGHT]->(b)
RETURN
 b.name
  AS 
   Brand, 
 COUNT(b)
  AS
   SalesVolume 
 ORDER BY
  SalesVolume
   DESC
 LIMIT 3

If you are missing the familiar graph view and would like to see only the top 3 items by sale without any statistics like the volume, here is what we can do

// Using WITH to perform filter before returning the result

MATCH 
 ()-[:BOUGHT]->(b)
WITH 
 b, 
 COUNT(b) 
  AS 
   SalesVolume 
 ORDER BY 
  SalesVolume
   DESC 
 LIMIT 3
RETURN 
  b

User Story ID # 06: Display the most popular ’n’ number of Brands at the City Level

Reason: In the previous user story, we recommended to the user what’s trending at the platform level. Sometimes local tastes and purchasing behaviour differs from what’s popular at a platform level. How about recommending the Top 2 popular brands in their city?

// SubQuery is a technique to query within a query
// CALL{} is used to Subquery in Cypher
// DISTINCT will remove duplicates

MATCH (c:Customer)
WITH
 DISTINCT c.address
  AS 
   Address
CALL {
     WITH 
         Address
     MATCH 
           (c)-[r:BOUGHT]->(b)
     WHERE 
           c.address = Address
     RETURN 
           DISTINCT b.name
    AS 
     Brand,
           COUNT(r)
    AS 
     SalesVolume
           ORDER BY
                 SalesVolume
                  DESC
           LIMIT 2}
RETURN
 Address,
 Brand,
 SalesVolume

In this example, we have personalized the recommendation at the city level. One can extend the same logic to personalize based on any other Customer parameter, e.g. Country. Next time you see a “Most Popular in your region” style recommendation in an E-commerce, Podcast or Movie platform, you’ll understand how they could have approached this problem.

Key Takeaway:
— There is a larger Cypher Query that consolidates all results.
— A Sub-Query is running on a smaller dataset that meets specific criteria. In our case, customers belonging to the same Address.
— Output of Subquery is fed as input to the larger query for any post-processing.

User Story ID # 07: Finding the category of Brand the Customer has previously bought and finding other brands belonging to that category for recommendation.

Reason: It is critical to identify upselling and cross-selling opportunities. Observing the category of brands one buys makes it possible to recommend other items in the same category.

// First match to find all luxury brands bought by customer
// Second match to find all luxury brands, not part of previous list

MATCH 
    (c:Customer)-[r:BOUGHT]->(b1:Brand{category:'luxury'})
MATCH
    (b2:Brand{category:'luxury'})
    WHERE
        b1.name <> b2.name
RETURN
    c AS Customer,
    r AS Rltshp, 
    b1 AS AlreadyBuys,
    b2 AS CanBePitched

Since Julia Roberts bought the luxurious Chopard watch, our query recommends CeleBuy pitch other luxury items like the Pearl Necklace and the Ear Ring to her.

User Story ID # 08: For a given brand, finding other brand/s bought by the same Customer

Reason: Knowing what is generally bought together is a very powerful insight. Let’s find out what Brands were bought along with Dunkin Donuts.

// Assign known Brand to one variable and others to different variable
// Link them to a common customer node
MATCH
    (b1)<-[r1:BOUGHT]-(c)-[r2:BOUGHT]->(b2)
WHERE   
    b1.name = 'Dunkin D'
RETURN 
    b1,r1,c,r2,b2

In this particular case, Shoes & Sunglasses have little in common with Donuts. This will not be the case always.

User Story ID # 09: Finding Combi Pack / Bundling opportunities.

Reason: Have you ever seen instances of Toothpaste & Brush sold together? That is a simple explanation of a Combi Pack. Marketing teams want to create combi packs or bundles based on customer buying patterns.

This challenge extends to the previous user story, with the scope expanding to the entire Graph for all brands.

MATCH 
    (b1)<-[r1:BOUGHT]-(c)-[r2:BOUGHT]->(b2)
WITH 
    (b1.name)+'...'+(b2.name) AS CombiPack
RETURN 
    CombiPack ,
    count(CombiPack) AS CombiCount
    ORDER BY 
        CombiCount DESC

There is a slight problem with this Result. When the system scanned the Graph for the first time, it assigned Martini as b1 and found Fivertree as b2. When it treated Fivertree as b1, it discovered Martini as b2.

If we recollect our high school math, the system found both Permutations where the order matters, while we are only interested in the unique Combinations.

After all, we are searching for a CombiPack and not a PermuPack ;-)

Let us now restrict the result to only unique combinations.

// Every Node has an internal ID
// Let us match instances only when node Id for b1 is greater than b2
// This will ensure we see only one combination.
// There might be other clever ways to arrive at the same outcome ;-)

MATCH 
    (b1)<-[r1:BOUGHT]-(c)-[r2:BOUGHT]->(b2)
WHERE 
    ID(b1)>ID(b2)
WITH 
    (b1.name)+'...'+(b2.name) AS CombiPack
RETURN 
    CombiPack ,
    count(CombiPack) AS CombiCount
    ORDER BY 
        CombiCount DESC

Voila! We now have only unique combinations. Let’s reflect on the Result.

A Customer bought NikeAir & Dunkin Donuts together, but bundling them does not make great business sense. Ben Affleck may have bought this in our CeleBuy due to his movie and ad engagements with these brands.

The same Customer bought Martini and Mixer. Now that makes sense. The marketing team could think of a “Say Cheers !” CombiPack with these two brands ;-)

There are many instances of customers buying SunScreen & RayBan Coolers. One could think of a “Cool Eye — Cool Skin” mini CombiPack.

Wait a minute. CeleBuy could develop a “Beat-The-Heat” CombiPack containing RayBan coolers, M&G Beach Towel, Sun Screen & Brixton Hat!!

One could even package this bundle under a much larger “Summer Beach” campaign, where altering the individual components can create many variants of Beat-The-Heat CombiPacks. Marketing Possibilities are endless!!

Part 3

In Part 3, we’ll focus on the 2 Dimensional relation approach, i.e. we’ll recommend based on who bought what and also how people are related to each other

CeleBuy has developed a new feature where a user can search for other users and capture details of how they are related.

Let’s check out what relations are recorded between customers.

Oh, that’s interesting. A summary of the relations:

Not all celebrities are tagged to other users
Of those who are tagged, most are linked to a BUDDY_OF relationship. Fair enough
We have one SPOUSE_OF relationship. Who doesn’t know the JenBen couple?
Wait a minute. Two users tagged each other on an ENEMY_OF relationship. Really? Oh, that’s Matt Damon and Jimmy Kimmel. We ran out of time to discuss the topic further ;-)

Let’s build the relationship links.

MATCH 
 (c3:Customer{name:'Ben Affleck'}),(c5:Customer{name:'G Clooney'}),
 (c6:Customer{name:'J Lopez'}),(c7:Customer{name:'J Roberts'}),
 (c8:Customer{name:'Jimmy Kimmel'}),(c9:Customer{name:'K Smith'}),
 (c11:Customer{name:'Matt Damon'})
CREATE
 (c3)-[:SPOUSE_OF]->(c6),(c3)-[:BUDDY_OF]->(c8),(c3)-[:BUDDY_OF]->(c9),
 (c3)-[:BUDDY_OF]->(c11),(c5)-[:BUDDY_OF]->(c7),(c5)-[:BUDDY_OF]->(c11),
 (c6)-[:SPOUSE_OF]->(c3),(c7)-[:BUDDY_OF]->(c5),(c8)-[:BUDDY_OF]->(c3),
 (c8)-[:ENEMY_OF]->(c11),(c9)-[:BUDDY_OF]->(c3),(c11)-[:BUDDY_OF]->(c3),
 (c11)-[:BUDDY_OF]->(c5),(c11)-[:ENEMY_OF]->(c8)

Let’s now check how the customers are linked in the portal.

MATCH (c1:Customer)
MATCH (c2:Customer)-[r]->(c3:Customer)
RETURN c1,c2,r,c3

Let’s check the brand and customer nodes that are linked to other nodes. This sub-graph will become the basis for all further recommendations in this part.

MATCH (c3:Customer)-[r1]->(c4:Customer)
MATCH (c5:Customer)-[r2]->(b5:Brand)
RETURN c3,r1,c4,c5,r2,b5

User Story ID # 10: Recommend Brands based on what customer’s direct connects are buying

// Find all brands bought by direct connects
// Remove the brands customer has already bought
// Use Distinct to have unique list of brands to avoid duplicates
MATCH
    (c1:Customer{name:'Ben Affleck'})-[r]->(c2:Customer)-[r2:BOUGHT]->(b1:Brand)
WITH 
    DISTINCT b1,c1
WHERE 
    NOT (c1)-[:BOUGHT]->(b1)
RETURN 
    c1,b1

Let’s reflect a bit on the brand recommendation for Ben Affleck :

Matt Damon & J Lopez, two direct-connects, have bought the Sunscreen.
K Smith bought the Brixton Hat. Ok, That makes sense!
J Lo also bought a Beach Towel. No wonder why it is part of the list.
Both K Smith and J Lo also bought RayBan coolers, but it is not part of the recommendation. That’s because, in our code, we have asked the system to remove anything that the user has already bought.

User Story ID # 11: Recommend Brands based on what customer’s direct connects are buying based on specific relationtype

What if we are interested in recommending only based on certain relationship types between the customers? Let’s run the previous result for Ben Affleck, limiting the graph only to the BUDDY_OF relation.

// Find all brands bought by direct connects limiting to BUDDY_OF relation
// Remove the brands customer has already bought
MATCH
    (c1:Customer{name:'Ben Affleck'})-[r:BUDDY_OF]->(c2:Customer)-[r2:BOUGHT]->(b1:Brand)
WITH 
    DISTINCT b1,c1
WHERE 
    NOT (c1)-[:BOUGHT]->(b1)
RETURN 
    c1,b1

A short comparison between the previous result and the new result:

Even after removing J Lo based connections, Sunscreen still appears because Matt Damon, a BUDDY_OF connect, also bought Sun Screen.
No change to that Brixton Hat bought by K Smith
That Beach Towel bought by J Lo does not appear in the recommendation list for Ben.

User Story ID # 12: Extending recommendation from direct connects to customers in the nth hop

Let us now increase the range of customers from friends in the first hop to the second hop, i.e. to include a friend of a friend. All one must do is specify the hop distance in the relation. It is that simple.

// Specify the starting and ending hop distance in relation
MATCH
    (c1:Customer{name:'Ben Affleck'})-[r*1..2]->(c2:Customer)-[r2:BOUGHT]->(b1:Brand)
WITH 
    DISTINCT b1,c1
WHERE 
    NOT (c1)-[:BOUGHT]->(b1)
RETURN 
    c1,b1

We have two new recommendations FiverTree and Martini, for Ben Affleck, coming from George Clooney, a friend of Matt Damon who is a direct connection to Ben Affleck.

The logic remains the same for any hop distance.

User Story ID # 13: Recommendation at nth hop

How about skipping immediate connections and recommending brands based on a customer who is two hops away? Let’s try that with Julia Roberts now.

Like all other features, it is a very small tweak to the code to get that result!

// Specify the precise hop distance after the *
MATCH
    (c1:Customer{name:'J Roberts'})-[r*2]->(c2:Customer)-[r2:BOUGHT]->(b1:Brand)
WITH 
    DISTINCT b1,c1,c2
WHERE 
    NOT (c1)-[:BOUGHT]->(b1)
RETURN 
    c1,b1,c2

The system skipped George Clooney, the direct connect and made recommendation based on the friend of friend Matt Damon, i.e. customer in the 2nd hop.

User Story ID # 14: Recommendation with no restriction on how far two customers are directed.

What we are looking for is to search for customers at every hop before making a recommendation. In simple words, to explore the entire network.

// Mention * to search the entire network
MATCH
    (c1:Customer{name:'Ben Affleck'})-[*]->(c2:Customer)-[r2:BOUGHT]->(b1:Brand)
WITH 
    DISTINCT b1,c1,c2
WHERE 
    NOT (c1)-[:BOUGHT]->(b1)
RETURN 
    c1,b1

Bonus User Story ID # 15: Friend recommendation based on Triadic Closure

Let’s take a slight digression from our brand recommendation exercise. LinkedIn and Facebook recommend names of people we can connect with.

Triadic Closure is a concept in social networks which, in simple terms, means if person A is a friend of Person B and Person C, eventually, Person B and C will become friends. Treating them as points, they form a triangle, hence the word Triad.

MATCH
    (a)-[:BUDDY_OF]->(b)-[:BUDDY_OF]->(c)
WHERE
    NOT 
        (a)-[:BUDDY_OF]->(c) 
    AND 
        a <> b  
    AND 
        ID(a) > ID(b) 
    AND 
        ID(a) > ID(c)
RETURN 
    a.name AS FirstPerson,
    c.name As SecondPerson ,
    b.name AS CommonFriend 
        ORDER BY 
            CommonFriend

To cross-verify the result, please check the graph segment that deals with nodes connected to the BUDDY_OF relation.

The next time you see a connection recommendation on LinkedIn or Facebook, you can guess how they could have developed that feature. As one can see, it is not that complex for Neo4j or graph databases in general to handle such connected datasets.

Part 4

In Part 4, we’ll focus on the basics of Similarity algorithms and samples using Neo4j

It is the same photo again, but this time with his signature. Paul Jaccard was a professor at ETH Zurich between 1903 and 1938. He dealt with Botany and Plant Physiology. He was comparing different regions from the Alps.

His quest was to find how similar the floras were and to quantify that similarity between two floras in a score.

He had information about the type of species that grew in the Alps. Please note this is not actual data from the flora but a representation for understanding.

I tried to simplify his approach in the following snapshot :

For every flora pair, he got a score by dividing the count of what is common to both by a count of the unique list of all items in both flora.

Points to note :
-Denominator is not about all possible species under consideration for all ecosystems, i.e. it is not 10. It is only the count of items in both flora
- Again, in the denominator, we need a unique list and not double-count what is present in both. Hence the ‘ minus c’ in the a+b-c formula in our calculation
-The higher the score, the more similar the floras are. In our example, if we rank the pairs in terms of similarity, Flora 1 & Flora 3 are the most similar, and Flora 1 & Flora 2 are the least similar.

But wait a minute, why are we even discussing botanical research that is more than 120 years old while the topic of the day is recommendations?

It’s time to connect the dots. Look at the following, and it will dawn upon you where we are heading to!!

Using the same century-old logic, we can also determine customer similarity scores based on what they bought. Twelve CeleBuy customers should result in 66 unique pair of customers, and we can get a similarity score for each pair.

Many variations evolved after the initial Jaccard Index to quantify the extent of similarity between two nodes. For example, The Sørensen and Overlap Index are two other ways.

How about translating this knowledge into a Neo4j Query? After multiple attempts, I guess I got it right!

MATCH
    (c1:Customer)
MATCH
    (c2:Customer)
WHERE
    c1.name <> c2.name
    AND
    ID(c2) >ID(c1)
    AND
    (c1)-[:BOUGHT]->()
    AND
    (c2)-[:BOUGHT]->()
    AND
    COUNT {(c1)-[:BOUGHT]->(b)<-[:BOUGHT]-(c2)} >0
WITH c1,c2
CALL {
        WITH 
            c1,c2
        RETURN
            c1.name AS C1 , 
            c2.name AS C2,
            COUNT {(c1)-[:BOUGHT]->()} AS C1BoughtCount , 
            COUNT {(c2)-[:BOUGHT]->()} AS C2BoughtCount,
            COUNT {(c1)-[:BOUGHT]->(b)<-[:BOUGHT]-(c2)} AS BothBoughtCount
    }
RETURN
    C1 ,
    C2,
    C1BoughtCount, 
    C2BoughtCount,
    BothBoughtCount,
    BothBoughtCount * 1.0 / (C1BoughtCount + C2BoughtCount - BothBoughtCount)
        AS JaccardIndx,
    BothBoughtCount * 2.0 / (C1BoughtCount + C2BoughtCount)
        AS SørensenIndx,
    CASE
        WHEN C1BoughtCount  <= C2BoughtCount 
        THEN BothBoughtCount * 1.0 / (C1BoughtCount)
        ELSE BothBoughtCount * 1.0 / (C2BoughtCount)
        END
            AS OverlapIndx
        ORDER BY 
        JakkardIndx DESC

Ideally, we should have 66 pairs. If you count the result, we’ll find only 21 pairs. Guess why?

In the query, I’ve added a filter condition that ensures when no single brand is common to both customers; we eliminate those pairs from the result. But why are we doing this?

All three indices have the count of common items bought(Intersection in Set Theory) in the numerator. When the numerator is Zero, all three indices will be zero. Those are customer pairs that are not similar at all.

The bigger question that should be running in our minds is the utility value of a similarity score. How does this help in recommendations?

With the similarity score, we can create a new: SIMILAR_TO relationship between the customer nodes and include the score as an attribute.

Let me create the SIMILAR_TO relation between Ryan Gosling and other customers. Ideally, One should do this for all customers, but for the sake of simplicity, I am restricting it to only Ryan G.

Let’s create the relation in the Graph.

MATCH 
    (c3:Customer{name:'Ben Affleck'}),(c6:Customer{name:'J Lopez'}),
    (c9:Customer{name:'K Smith'}),
    (c11:Customer{name:'Matt Damon'}),(c12:Customer{name:'Ryan G'})
CREATE
    (c12)-[:SIMILAR_TO{jcrdScore:1.0}]->(c11),
    (c12)-[:SIMILAR_TO{jcrdScore:0.33}]->(c9),
    (c12)-[:SIMILAR_TO{jcrdScore:0.66}]->(c6),
    (c12)-[:SIMILAR_TO{jcrdScore:0.25}]->(c3)

It’s time to check how the network of Brands and Customers look through the lens of the new SIMILAR_TO relation.

MATCH 
    (b1)<-[r1:BOUGHT]-(c1)-[r2:SIMILAR_TO]->(c2)-[r3:BOUGHT]->(b2)
WHERE 
    c1.name = 'Ryan G'
RETURN
    b1,r1,c1,r2,c2,r3,b2

User Story ID # 16: Brand recommendation based on Similarity relation

Let us try a recommendation for Ryan G based on the Brand Similarity network.

MATCH 
    (b1)<-[r1:BOUGHT]-(c1)-[r2:SIMILAR_TO]->(c2)-[r3:BOUGHT]->(b2)
WHERE 
    NOT (c1)-[:BOUGHT]->(b2) AND
    c1.name = 'Ryan G'
RETURN
    c1,b2

User Story ID # 17: Brand recommendation based on Similarity Score

With a quantified Similarity score in place, it makes more sense to check for recommendations based on highly similar other clients. Let us find the network that qualifies for a Jaccard Similarity greater than 0.5

MATCH 
    (b1)<-[r1:BOUGHT]-(c1)-[r2:SIMILAR_TO]->(c2)-[r3:BOUGHT]->(b2)
WHERE 
    c1.name = 'Ryan G' AND r2.jcrdScore > 0.5
RETURN
    b1,r1,c1,r2,c2,r3,b2

What happened to Kevin Smith & Ben Affleck, who appeared in the previous user story? The answer lies in their similarity score with Ryan G.

Let’s get our new recommendations for Ryan G with this additional filtering criterion of other customers with a similarity score greater than 0.5

MATCH 
    (b1)<-[r1:BOUGHT]-(c1)-[r2:SIMILAR_TO]->(c2)-[r3:BOUGHT]->(b2)
WHERE 
    NOT (c1)-[:BOUGHT]->(b2) AND
    c1.name = 'Ryan G' AND
    r2.jcrdScore > 0.5
RETURN
    c1,b2

Not in his wildest dreams would Professor Jaccard have ever thought his research on ecosystems would one day lay the foundation for movie, music and product recommendations a century later!

Part 5

In Part 5, we’ll focus on Neo4j Data Science Module and of course we conclude ;-)

In the previous parts, we realize Neo4j is not just another database but a solution that can answer complicated queries with few lines of code.

Please don’t get deceived by its simplicity. If there is one thing I have learned in all my years of Tech Program and Product experience, it is complex and takes effort to design something simple for the user. It is even more complicated and takes a lot of organizational discipline to retain that simplicity over time ;-)

The extent of heavy lifting Neo4j is doing helps the main platform/software application to conserve its Infrastructure resource to focus on core functionality. If a database acts just as a store of information, the application must devote its resources and energy (literally!) to perform the complex logic.

If Neo4J Graph DB is an excellent platform, why is there a Neo4J Graph Data Science module? It will become evident while scaling.

Our CeleBuy portal had 12 customers and 12 Brands. The queries we have written for this write-up will exactly be the same if it is 12 million customers and 12 million Brands. However, one can think of the computing power required to scan the entire Graph to return the results. The Data Science module has a concept called Projection to handle compute-intensive calculations.

In conventional Neo4j, data is stored in the hard disk. Projection temporarily moves a full graph or a sub-section of the Graph from disk to Memory (RAM). Performing a process in Memory is an order of magnitude faster than performing the same calculation on the disk. Ever heard of the word ‘In-Memory’? Projection is all about that.

Second, look at the Cypher code I wrote to determine different Similarity scores. Is it performance efficient? Will it survive when the dataset gets exponentially bigger? Is it taking maximum advantage of the underlying system hardware? Absolutely not! I used standard Neo4j DB to explain similarity as a concept. This code is more than sufficient. However, many algorithms are available inside the Data Science modules for industry needs. Project the Graph, and if one follows the recommended syntax, finding answers is a simple process.

Third, in our demo, knowing how similar customers are did not deliver the desired outcome. We only hit the bull’s eye when we used that similarity score and built a SIMILAR_TO relation. I created that relation manually, and that too only for one customer, Ryan G. Is it scalable when there are millions of customers? That is where the Data Science module gives the flexibility to do two things. One can use the standard algorithms and create new relationships in the Projection based on specific rules and return just the results. As mentioned earlier, a Projection is temporary since it is on Memory. There is a provision to write back to the original dataset, precisely like how we added the SIMILAR_TO to the original database.

Fourth, The module comes with functionalities to train and test datasets for Machine Learning. It has all the bells and whistles.

If you have read this far, I’m honoured that you spent your time reading this multi-part write-up.

Sharing this write-up with someone who could benefit from learning about Recommendation systems or Neo4j would be greatly appreciated!!

To know more about the writer, check out the following link