Introduction To Graph Databases

8 min readMar 16, 2023

Graph databases are a great tool in the world of big data. They provide a robust and easily understandable way to store, query and analyse complex relationships between data points. Unlike traditional databases, graph databases store information as a series of connected nodes, making identifying patterns, trends and dependencies easier. This makes them particularly useful when dealing with highly interconnected data sets.

This article will describe the idea behind graph databases, describes the different query languages and databases and finishes with a practical example.

Graph databases are fast and efficient. They allow you to quickly traverse the relationships between data points and make queries that would take much longer in traditional databases. This makes them ideal for applications dealing with large amounts of data, such as social networks, fraud detection and recommendation systems.

Graph databases are based on graph theory, which is the study of graphs and their properties. Graphs are mathematical structures that can be used to model relationships between objects. These relationships can be represented by nodes (vertices) and edges. Nodes are entities or documents, while edges form the connections or relationships between nodes.

For example, a node could be of type Person and contain data about a person's name, address and age. Unlike in a relational database, where we might store foreign keys in a table row, we don’t need to do that in a graph database. We only have to store the data. Types in graph databases are often called labels. In this case, the label is “Person”.

Edges define the relationships between nodes. Edges can have a type and can contain data. An edge type between two nodes of type person can, for example, be “Knows” with a data field “since”. Combining two Person nodes and an edge of type Knows gives us a graph describing these people and their relationship. For a set of two other people, we could have an edge of type “Manages”, describing a type of work relationship. Edges can be directed or undirected, depending on the relationship. A person managing another person is a DIRECTED graph, while a friendship between two people is UNDIRECTED since friendship is mutual.

You don’t need a dedicated graph database to use graphs; they are a conceptual construct around defining relationships and querying data and have been implemented in SQL and document databases like Postgres and MongoDB

Graph theory is also concerned with operations on the data and specifically querying. How can we find the shortest path between two nodes, the number of edges of a specific type etc.

What sort of data can you store in a graph?

Graphs are great for storing objects with complex relationships. Examples are relationships between people, where the number of potential types of relationships would make (no)SQL unwieldy and challenging to maintain and query. In SQL, it is relatively easy to write JOIN queries, but when the joins become too deep because we need to find a friend of the friend of a relative of a person in our database, SQL falls apart, but a graph starts to shine.
Graphs do so well in this case because they have direct relationships between nodes and edges, while SQL uses indexes and foreign keys to connect rows together.

Knowledge bases and data mining form an exciting use case for graph databases. Knowledge graphs are perfect for creating structure in unstructured data and finding clues and links while data mining. By connecting data and adding semantics, we can add context and meaning, increasing the value of data. We can discover clusters and relationships that weren’t obvious at first sight when using ML techniques for data mining.

Forensic data for crimes or information security can be stored and analysed using graph databases, another form of knowledge graphs. I am sure we can remember the CSI whiteboards, using pieces of string to connect pieces of information and pictures. Why not replace that with an easy-to-use database?

The most well-known use of graph data is its use in mapping and route finding. Mapping solutions like Google maps use graph theory and graph databases to model our physical world and find optimal routes from A to B.

There are various pros and cons of using a graph database vs a relational database. Some of the pros of using a graph database include the following:

Graph databases can be used to store more complex data than relational databases. Relationships can be more varied and have more depth. Relational databases have limited depth by design.
Graph databases can be used to query data in more complex ways than relational databases. It is possible, for example, to find all the roads between two cities or to find the number of edges between two unconnected people. In a graph, objects have a reference to connected edges, while relational databases use expensive joins to find elements by index.
Graph databases can be used to find relationships between data that would be difficult to find with a relational database. Using machine learning algorithms, finding clusters of similar data in a knowledge base is possible by analysing overlapping connections. Neo4J has 65 ML algorithms targeted towards data science to analyse graph data.

Some of the less favourable aspects of using a graph database include the following:

Graph databases can be more challenging to design and implement than relational databases. Referential integrity is handled differently for example. What do we do when we remove an entity? Should we remove any orphaned entities as well?
Graph databases can be more difficult to query than relational databases; the endless possibilities to design queries can send us down the wrong path. Simple queries are relatively easy, but complexity can increase quickly.
Graph databases can be more challenging to scale than relational databases.

It is up to the reader to decide if a graph database fits their use case. If you think it is, some options are listed below.

Query languages

Two different languages have been developed that form a de-facto standard for querying graph databases. Some graph databases even support more than one query language. Gremlin and Cypher are the languages to choose from if you want to get started.

Gremlin

Gremlin is a functional graph traversal language which is part of Apache Tinkerpop. It is supported by most of the major graph databases, so it is a good candidate to learn.

Looking at this example we see that it is rather intuitive at first sight.

g.V().hasLabel(‘person’).out(‘knows’).values(‘name’)

This query selects the vertices with label or type person and selects the outgoing edges of type knows, and returns the name of the people that are found. The language uses chained functions as its syntax. It is pretty simple to use, but it is also possible to create rather complicated queries. Take the following, for example, where we select people with two levels of separation.

g.V().hasLabel(‘person’).repeat(out(‘knows’).simplePath()).until(loops().is(2)).path().by(‘name’)

You can try out Gremlin at https://gremlify.com/

For a complete guide using Gremlin, have a look at https://kelvinlawrence.net/book/Gremlin-Graph-Guide.html

openCypher

Cypher is a declarative language first developed and used by Neo4J. I think the ASCII-art-like query structure is quite cool. OpenCypher is supported by a variety of databases.

MATCH (p:person)-[:knows]->(q:person) RETURN q.name

I think Cypher is a bit more expressive, and I prefer it over Gremlin. It is easy to start with, but it can become very complex for some queries.

MATCH p=shortestPath( (a:User {name: “Mary”, surname: “Smith”})-[*]- (b:User {name: “Jane”, surname: “Jones”}) )

Graph Database Options

Some popular graph databases are Neo4j, OrientDB, and ArangoDB, but some traditional databases also support graphs. These include PostgreSQL, MySQL, Microsoft SQL Server and even Redis. All major providers have a managed cloud database option. We can choose AWS Neptune, Azure Cosmos or other options at the other major cloud providers.

Neo4j and OrientDB have nice graphical interfaces to explore and visualise data. All the dedicated graph databases are schemaless like a NoSQL database, which makes them easy to use and versatile.

Practical Example

Let’s create an example graph database using Neo4j.

It is easy to get started using docker, or you could start a Sandbox at neo4j.com or sign up for the free cloud-based AuraDB version of Neo4J.

Start the docker container with the command below.

docker run \
 - name testneo4j \
-p7474:7474 -p7687:7687 \
-d \
-v $HOME/neo4j/data:/data \
-v $HOME/neo4j/logs:/logs \
-v $HOME/neo4j/import:/var/lib/neo4j/import \
-v $HOME/neo4j/plugins:/plugins \
 - env NEO4J_AUTH=neo4j/test \
neo4j:latest

For the local dashboard, you can use http://localhost:7474/browser/ to log in with user neo4j and password test.

Or login to the docker container and then start the cypher shell so we can begin executing queries.

docker exec -it testneo4j bash
cypher-shell -u neo4j -p test

You can connect to neo4j using a browser for a more visually pleasing interface. Go to http://127.0.0.1:7474/browser/ to open the web interface. The visual graph representations and data explorations we can get here are helpful.

![Neo4j interface](./images/neo4j-interface.png)

The Zoo Example

We will create a zoo database, which will hold all the information for the efficient running of the animal park.

The database consists of three types of nodes: Employees, Animals, and Locations. There are also relationships between the nodes, such as “works at” or “lives in”.

Employees: Each employee node has a name, job title, and date of hire.

Animals: Each animal node has a name, type, and date of birth.

Locations: Each location node has a name, type (ex. Exhibit, Cage, etc.), and size.

Relationships:

Employees work at Locations
Animals live in Locations
Employees care for Animals

Execute the query below to fill the database with data. We have to execute this as one query to create the relationships between the nodes. Notice that we don’t have to create or define a schema.

CREATE (e1:Employee {name: "John Smith", job: "Zookeeper", hire_date: "1/1/2020"}) CREATE (e2:Employee {name: "Jane Doe", job: "Veterinarian", hire_date: "2/1/2020"})
CREATE (a1:Animal {name: "Lion", type: "Mammal", birth_date: "3/1/2020"})
CREATE (a2:Animal {name: "Tiger", type: "Mammal", birth_date: "4/1/2020"})
CREATE (l1:Location {name: "Lion Exhibit", type: "Exhibit", size: "Large"})
CREATE (l2:Location {name: "Tiger Cage", type: "Cage", size: "Small"})
CREATE (e1)-[r1:WORKS_AT]->(l1)
CREATE (e2)-[r2:WORKS_AT]->(l2)
CREATE (a1)-[r3:LIVES_IN]->(l1)
CREATE (a2)-[r4:LIVES_IN]->(l2)
CREATE (e1)-[r5:CARES_FOR]->(a1)
CREATE (e2)-[r6:CARES_FOR]->(a2);

We can run some basic queries now that we have the initial data. Let’s get the names of the employees and the animals they care for:

MATCH (e:Employee)-[r:CARES_FOR]->(a:Animal) RETURN e.name, a.name;

This will return a table of zookeepers and the animal(s) they care for. Instead of just single properties, we can also select the whole object.

MATCH (e:Employee)-[r:CARES_FOR]->(a:Animal) RETURN e, a;

When we use this query in the browser, we can select an excellent graphical representation of our data.

We can also find out which animals are living in which locations by querying the following:

MATCH (a:Animal)-[r:LIVES_IN]->(l:Location) RETURN a.name, l.name;

The zoo has invested in a tiger forest and will get rid of the cage. Let’s move the tiger.

MATCH (a:Animal {name: 'Tiger'})-[r:LIVES_IN]->(l2:Location {name: "Tiger Cage"})
DELETE r
CREATE (l1:Location {name: "Tiger Forest", type: "Exhibit", size: "Large"})
MATCH (a:Animal {name: 'Tiger'})
CREATE (a)-[r2:LIVES_IN {since: datetime("2022–12–01T18:40:32.142+0100")}]->(l1:Location {name: "Tiger Forest"});

We can set information on the relationship itself, in this case the date and time of the move. We can add information not only to nodes but also to edges, letting us create powerful constructs.

Now that we have the database, practice adding more animals, keepers and locations. Another idea would be to add relationships for employees who manage employees. We can’t cover all options and queries in this article, so I encourage you to explore them yourself.