A Technical Odyssey from Relational Database to Graph Database

Published in

Just Eat Takeaway-tech

11 min readDec 12, 2023

Navigating complex relationships often requires innovative solutions in database design. This article dives into a scenario where we tackle the challenge of uncovering connections between users based on certain attributes.

Problem Statement

Imagine a scenario where we need to establish relationships between users, connecting them based on shared activities that they like and don’t like. The goal is to find related users given a starting point, all while preserving the ability to search for information using the user’s name.

For instance, let’s assume we have three users:

Alice: likes sleeping, hates cooking.
John: likes jogging, hates cooking.
Max: likes jogging, hates playing.

The following schema is a demonstration of the relations between users. In our case, the relations should be established based on particular attributes. ALICE has a direct relation with JOHN because they both hate cooking, and JOHN has a direct relation with MAX because they both like jogging. There is an indirect relation between ALICE and MAX. Finding indirect relations makes the problem even harder to solve.

[ALICE] ----- [JOHN] ----- [MAX]

First, we will consider using MySQL, a relational database, to solve the problem, applying the following approaches:

Adjacency List
Adjacency List with a tree

Secondly, we will create nodes and edges in Neo4j, a graph database, to solve the problem.

You can use the following docker-compose file to have both databases locally:

services:
  mysql-db:
    image: mysql:latest
    restart: unless-stopped
    environment:
      MYSQL_ROOT_PASSWORD: root
      MYSQL_USER: testuser
      MYSQL_PASSWORD: testpassword
      MYSQL_DATABASE: test-db
    ports:
      - "3306:3306"
    volumes:
      - ./test-service-stack/mysql/data:/var/lib/mysql/
  neo4j:
    image: neo4j:5
    restart: unless-stopped
    ports:
      - "7474:7474"
      - "7687:7687"
    volumes:
      - ./test-service-stack/neo4j/:/conf
      - ./test-service-stack/neo4j/data:/data
      - ./test-service-stack/neo4j/import:/import
      - ./test-service-stack/neo4j/logs:/logs
      - ./test-service-stack/neo4j/plugins:/plugins
    environment:
      - NEO4J_dbms_memory_pagecache_size=2G
      - NEO4J_dbms.memory.heap.initial_size=2G
      - NEO4J_dbms_memory_heap_max__size=2G
      - NEO4J_AUTH=neo4j/!Secret1
      - NEO4JLABS_PLUGINS=["apoc"]
      - NEO4J_dbms_security_procedures_whitelist=gds.*, apoc.*
      - NEO4J_dbms_security_procedures_unrestricted=gds.*, apoc.*

Adjacency List

In graph theory and computer science, an Adjacency List is a collection of unordered lists that represent a finite graph. In the context of our database design, this approach involves connecting entities when they are saved. Each node in the graph represents a user, and the edges between nodes signify relationships based on shared attributes.

We need two tables: one to store user information and another to store their interests.


CREATE TABLE `USER` (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    parent_id INT
);

ALTER TABLE `USER`
ADD CONSTRAINT FK_user_parent_id
FOREIGN KEY (parent_id) REFERENCES `USER`(id);

ALTER TABLE `USER` ADD INDEX (name);

CREATE TABLE INTEREST (
     id INT PRIMARY KEY AUTO_INCREMENT,
     user_name VARCHAR(255) NOT NULL,
    `like` VARCHAR(255) NOT NULL,
    `hate` VARCHAR(255) NOT NULL
);

ALTER TABLE `INTEREST` ADD INDEX (user_name);

The USER table has a foreign key relationship between the ‘id’ and ‘parent_id’ columns. This relationship is necessary for querying related users. INTEREST table stores liked and hated the activities of users.

There is no foreign key between the INTEREST and USER table because we need to insert their interests before inserting users.

After executing it, we should have the following structure:

Creating Parent-Child Relationships in Database Insertions

We aim to establish a relationship during the user insertion process. As soon as we have a new user with their likes and dislikes defined, we can use the INSERT INTO SELECT query to find other users who share the same interests. This will help us save time while identifying potential matches for the new user.

INSERT INTO `USER`(name, parent_id)
SELECT :name,
       u.id
FROM user u
         LEFT JOIN INTEREST i1 on u.name = i.user_name
WHERE
        i.like = :like
        OR i.hate = :hate;

We cannot insert the first user using the given query because the SELECT query within the INSERT INTO statement is not returning any results due to the fact that there are no related users in the database. To solve this, let’s first insert the user along with their interest.

INSERT INTO INTEREST (user_name, `like`, hate) values ('ALICE', 'sleeping', 'cooking');
INSERT INTO USER (name, parent_id) VALUES ('ALICE', NULL);

The other users can be inserted using the INSERT INTO SELECT query after inserting their interests.

For example, assuming we receive the user JOHN, we can run the following queries to insert both its interest and the user:

INSERT INTO INTEREST (user_name, `like`, hate) values ('JOHN', 'jogging', 'cooking');
INSERT INTO `USER`(name, parent_id)
          SELECT 'JOHN',
                 u.id
          FROM user u
                   LEFT JOIN INTEREST i on u.name = i.user_name
          WHERE
                  i.like = 'jogging'
                  OR i.hate = 'cooking';

After inserting all users, all users will be connected as follows:

Getting related users using a recursive query

We have succeeded in inserting our users with their parent-child relations. Now, we should be able to retrieve them all by giving the name of a user. Different Relational Database Management Systems (RDMS) offer diverse functions capable of executing various operations. Fortunately, the recursive query is a tool in this arsenal. It’s worth noting that the structure of this query may vary across databases. In our specific scenario, we will leverage the recursive query functionality tailored for MySQL.

WITH RECURSIVE rectree AS (
  SELECT *
    FROM user
   WHERE name = :name
UNION
  SELECT u.*
    FROM user u
    JOIN rectree
      ON u.parent_id = rectree.id
UNION
  SELECT u.*
    FROM user u
    JOIN rectree
      ON u.id = rectree.parent_id
) SELECT name FROM rectree;

This query, starting from a specified user, recursively retrieves that user’s descendants and ancestors from the user table, forming a hierarchical tree structure. The result is a list of user names.

However, recursive queries could perform slowly if you have a large set of data due to their nature. How can we overcome this problem?

Tree Structure

In the initial method, while we could store new users with a parent ID, establishing connections for all these users was challenging, and required an inefficient approach with a recursive query. If we possess the parent ID during user insertion, we can assign them a distinctive ID. This way, rather than navigating users solely based on the parent ID, we can retrieve them using this unique identifier called tree ID.

Below, you’ll find the same tables as before, but this time, we’ve introduced the tree_id column to the USER table for added context.

CREATE TABLE `USER` (
    id INT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(255) NOT NULL,
    parent_id INT,
    tree_id INT NOT NULL
);

ALTER TABLE `USER`
ADD CONSTRAINT FK_user_parent_id
FOREIGN KEY (parent_id) REFERENCES `USER`(id);

ALTER TABLE `USER` ADD INDEX (name);
ALTER TABLE `USER` ADD INDEX (tree_id);

CREATE TABLE INTEREST (
     id INT PRIMARY KEY AUTO_INCREMENT,
     user_name VARCHAR(255) NOT NULL,
    `like` VARCHAR(255) NOT NULL,
    `hate` VARCHAR(255) NOT NULL
);

ALTER TABLE `INTEREST` ADD INDEX (user_name);

To insert the first user into the database, we will follow the same procedure as before. However, this time, we will assign a unique identifier for the tree_id. All the users who are related to each other will share the same tree ID. But if there are no related users in the database, we will have to assign a new tree ID to the user.

INSERT INTO INTEREST (user_name, `like`, hate) values ('ALICE', 'sleeping', 'cooking');
INSERT INTO USER (name, parent_id, tree_id) VALUES ('ALICE', NULL, 123);

Also, let’s update our INSERT INTO SELECT query so that it can get the tree ID from the related user and insert it for the new one.

INSERT INTO `USER`(name, parent_id, tree_id)
          SELECT :name,
                 u.id,
                 u.tree_id
          FROM user u
                   LEFT JOIN INTEREST i on u.name = i.user_name
          WHERE
                  i.like = :like
                  OR i.hate = :hate;

Before inserting the users, please ensure to insert their attributes first, and then insert the user itself. As a result, you should have all three users sharing the same tree_id.

The recursive query is not needed anymore due to the existence of tree_id. Relates users can be retrieved by tree_id. Additionally, we may be able to get rid of the parent_id column, depending on our needs.

SELECT relatedUser.*
FROM user u
      JOIN user relatedUser on relatedUser.tree_id = u.tree_id
WHERE
         u.name = :name;

In case we need to consider another attribute in the future, we can add it as a new column in the INTEREST table. However, there is a more flexible approach we can take. We can change the structure of the table and replace the INTERESTS table with a more generic table called ATTRIBUTES. This table will allow us to store key-value pairs for any attribute we want to track, making it easier to adapt to future changes.

CREATE TABLE ATTRIBUTES (
     id INT PRIMARY KEY AUTO_INCREMENT,
     user_name VARCHAR(255) NOT NULL,
    `key` VARCHAR(255) NOT NULL,
    `value` VARCHAR(255) NOT NULL
);

In that case, we need to insert a couple of times, and the INSERT INTO SELECT query needs to be updated accordingly.

INSERT INTO ATTRIBUTES (user_name, `key`, value) values ('ALICE', 'like', 'sleeping');
INSERT INTO ATTRIBUTES (user_name, `key`, value) values ('ALICE', 'hate', 'cooking');

Drawbacks

We managed to connect users in a relational database based on particular attributes. However, we applied the best-case scenarios when we inserted users. This approach has some limitations.

SELECT query in INSERT INTO SELECT might get more than one user if it finds more relations, which can create ambiguity.
It does not allow us to merge two trees that were created in the past separately.
It is error-prone and can get complex in the future.
It is not a very extendable solution.

While every problem has a solution, it appears that we’re attempting to solve an issue that’s already been addressed by other database types.

Graph Database

Ironically, a relational database is not always the best choice for modeling the relations in your data. A graph database is a network of interconnected nodes and edges. Instead of tables, you have nodes that represent entities and edges that represent relationships between entities. Each node can have properties, much like fields in a table, and relationships are fundamental to the data model. This makes graph databases particularly effective in scenarios where relationships are as important, if not more important, than individual data points.

Neo4j, a graph database, can help overcome the problem. Three nodes are needed: USER, LIKE, and HATE.

For simplicity, let’s name the edges that define the relations between nodes as INTERESTS. The final view should look like this:

In graph databases, nodes and edges are created on-the-fly, without the need for predefined schemas. This is different from relational databases where tables and relationships are typically structured beforehand. In a graph database, nodes can have dynamic properties, and edges representing relationships can be established without predefined constraints. This flexibility is useful when dealing with evolving or unpredictable data structures, allowing for a more agile and adaptable approach compared to the structured nature of relational databases.

We can use Neo4j Browser to execute queries and view the contents of the database. If you run the docker-compose shared in this article, you should be able to access the Neo4j Browser at http://localhost:7474/browser/.

Inserting nodes and edges

Neo4j’s Cypher query language enables users to enable users to interact with and query graph data using a human-readable syntax. It is optimized for expressing patterns and operations in graph databases, making it easy to traverse and manipulate data represented as nodes, relationships, and properties in Neo4j.

A Cypher query can be constructed in a way that inserts all nodes and edges together.

MERGE (u:User { name: $name })
 ON CREATE SET u = { id: $id, name: $name }
 WITH u
 MERGE (h:Hate { name: $hate})
 ON CREATE SET h = { name: $hate }
 MERGE (u)-[:INTERESTS]->(h)
 WITH u
 MERGE (l:Like {name: $like})
 ON CREATE SET l = { name: $like }
 MERGE (u)-[:INTERESTS]->(l)

Retrieving connected nodes

After adding all nodes and their edges, we can use the APOC plugin to retrieve all connected users.

APOC (Awesome Procedures on Cypher) extends the functionality of the Cypher query languages with a powerful collection of additional operations, utilities, and features. It’s a valuable tool, as it offers procedures for data transformation, graph algorithms, and other advanced operations. For more information, you can check the documentation here.

MATCH (u:User  {name : $name})
      CALL apoc.path.subgraphNodes(u, {})
      YIELD node
      RETURN node

For retrieving all the users connected, we can use the query above. The subgraphNodes function can be used to filter or limit the query, but in our case, we don't need to apply any filter or limit. Therefore, we can pass an empty object to it.

It's worth noting that if you run a query in the Noe4j Browser, you'll be able to see all the connected nodes along with their edges. However, if you're using Spring Data Neo4j, you'll need to modify the query to provide more information. This additional information is necessary so that Spring can understand the relations and convert them into Java objects. In our case, using expandConfig should be sufficient.

MATCH (u:User {name : $name})
      CALL apoc.path.subgraphNodes(u, {})
      YIELD node
      WITH node
      CALL apoc.path.expandConfig(node, {} )
      YIELD path AS relatedUsers
      RETURN relatedUsers

If you have completed all of the necessary steps, you should now be able to see something that looks like this:

That’s it. We managed to solve our problem in the problem statement.

Conclusion

This technical exploration dived into the challenges of establishing relationships between users based on shared attributes, transitioning from a relational database (MySQL) to a graph database (Neo4j). The relational database approach involved using an Adjacency List model, considering parent-child relationships, and utilizing recursive queries for finding related users. However, this method has limitations in terms of scalability, complexity, and extensibility. Also, the first approach led us to use a recursive query that might reduce the performance.

The shift to a graph database, Neo4j, offered a more intuitive solution for handling relationships. The dynamic nature of graph databases allowed for the creation of nodes and edges on the fly, offering flexibility in representing relationships without predefined schemas. The Cypher query language, complemented by the APOC (Awesome Procedures on Cypher) library, facilitated the insertion and retrieval of connected nodes efficiently.

Ultimately, the graph database approach demonstrated a more natural and scalable solution for managing complex relationships, providing a valuable alternative to traditional relational database models. The flexibility and adaptability of graph databases, particularly evident in the Neo4j implementation, showcased the advantages of leveraging specialized tools when dealing with scenarios where relationships play a crucial role.

Just Eat Takeaway.com is hiring! Want to come to work with us? Apply today.