Tap into Hidden Connections — Translating Your Relational Data to Graph
*Last updated 4/11/2019
Graph databases are a relatively new space in the technology industry. However, that doesn’t stop many companies big and small from jumping in. There are many cases to be made for one path or another, but one thing can be demonstrated in using graph databases — the importance of and meaning hidden in relationships. Graph databases tap into how pieces of data connect to one another, showing even the most remote connections between seemingly random pieces of data.
Perhaps you have read or heard about graph databases. Maybe you have even learned a little bit about them. But where do you go from there? How do you start taking real-life data from one source and put it into a graph to realize the value hidden within?
Today, I’d like to walk you through one way to translate your data from a relational database into a Neo4j graph database. While there are plenty of other types of data sources (like NoSQL), this post will simply focus on going from relational to graph. Hopefully, in future posts, I can dive into moving other types of sources into graph and other ways to translate relational data to graph, as well.
Neo4j ETL Tool
The Neo4j ETL (Extract-Tranform-Load) tool was built to allow developers an easy way to load relational data structures into a graph. It includes a 3-step process that allows the user to specify a relational database via a JDBC setup, then edit the data model mapping that the tool creates for graph, and finally import all of the data itself to Neo4j.
The tool is easy enough to access and use. There are two ways to interact with the tool — via command line or via Neo4j Desktop application. For working with the command line, you can download the GitHub project and run each step at a command prompt.
Add the Neo4j ETL Application
For working with the user interface version in Neo4j Desktop, you need to add this key in Neo4j Desktop — https://r.neo4j.com/neo4j-etl-app. This is demonstrated in the screenshot below.
Once this is complete, you will have access to the ETL application. Now, we need to add it to your graph applications.
First, go to your
Projects menu within Neo4j Desktop and choose/create a project to work within. Then, click on the
Add Application under the first banner on the right menu pane. When the popup displays to show you the applications you can add, choose the ETL App by clicking the
Add button. The screenshots below step through this process visually.
Now we are ready to launch the application and start transforming data from relational to graph! To load the application, just click on the newly-added application in your project.
Select a Project to Work With
This will open a new window with the main screen of the application, as shown below. You will need to choose the project to work within. This will tell the application where the Neo4j database is that you want to load your data into. To the right, you can see the list of possible databases in that project that you can use. Since no connections show up on the left, we don’t have any relational databases defined in our tool, so we will click
ADD CONNECTION to set one up.
Setup RDBMS Connection
This screen is where we set up a relational database connection. The tool allows most types of relational databases with an JDBC Driver, including MySQL, PostgreSQL, Oracle, Cassandra, DB2, SQL Server, Derby, and others. While MySQL and PostgreSQL databases are built-in to the tool, all other databases can be easily set up by specifying a driver file. We will walk through each step below.
For this example, I will be using a PostgreSQL relational database with the Northwind dataset loaded. If you do not already have a sample database/dataset to work with, you can follow along with this post by downloading PostgreSQL and running the shell script pulled from a GitHub repository for the Northwind dataset for Postgres.
By choosing the database type from the
Type dropdown field first, some of the other fields will automatically populate for you and save a bit of hassle. For this example, you will choose the
postgresql option. Notice that the port and connection URL are already correct for that choice.
We can now fill in the rest of the details, as needed. You can put whatever you like in the
Connection Name field, but the
Database field must match the name you gave your relational database. If you are using another database outside of Postgres or MySQL, be sure to choose a driver file for the tool to use for connecting.
*Note: if you set up a username and password for your relational database, you must enter those in the appropriate fields. However, if you did not specify a username and/or password for your relational database, you can leave one or both of the fields blank.
Once all the details are filled in, click the
Test and Save Connection button at the bottom of the pane. You should see a blue message bar appear at the top to say your connection was saved successfully. A red bar with an error message will appear if one of the fields is not correct or it cannot find the relational database.
Retrieve RDBMS Metadata
Ok, now we have our relational database connection set up. Next, we need to map from relational to graph by choosing our relational database connection that we are loading from in the left list, and then selecting our graph database we are writing to in the list on the right. Then you can click the
Start Mapping button.
You will see a blue message bar at the top of your screen if the step was successful or a red one if the step failed. Since ours was successful, our
Start Mapping button inactivates, and the
Next button activates to proceed to the next step. If the step fails, you can click the
SEE LOGS button at the bottom to debug the process.
The Data in Mapping.json
This is where the
mapping.json file is created with an outline of the mapping between the databases. This is an important part of the process because you can make changes to this file if you decide you want to leave out particular tables or fields in the migration.
To look at the file, pull up your main Neo4j Desktop window, go to your project, then click
Manage on the Neo4j database where you will migrate the data (in this example, it is named ETL_db). Click the
Open Folder button in the header pane and go into the
There, you should see a
postgresql_northwind_mapping.json file (or some similarly-named file). You can open this and edit if there are any changes you want to make before going to the next step. For this example, I will leave the
mapping.json file without any changes.
Adapt the Mapping
Going back to the ETL application window, you can click the
Next button at the bottom right, which will take you to a data model screen. You can move the nodes and relationships in the diagram on the right side of the screen so that you can better view each of them.
Here, you can change names of nodes, relationships, and even property (field) names and data types. I do think we can choose more appropriate relationship types, changing them from generic nouns to more specific verbs.
For instance, I changed the relationship from Order node to Product from “
ORDER_DETAILS” to “
INCLUDES” because an order includes products (i.e. line items).
Once you are satisfied, you can click the
SAVE MAPPING button and then click the
NEXT button at the bottom right. This takes us to the final screen, which is where we choose our import method. There are two options, and both are explained below.
- Bulk Import — fast, bulk-import of large data sets, database must be offline to run
- Online Import — runs Cypher via BOLT connection, database can be online and running
The key thing to note on the import method is whether your Neo4j database is running/online or not.
Finally, you can click the
Import Data button and see the output display at the bottom of the screen from the behind-the-scenes commands that get automatically executed. When it completes, a blue message bar will appear at the top of the window to say that the import was successful (screenshot below).
Check Imported Data
To verify everything loaded and to start working with your data in Neo4j, we can look at the data model and run queries from Neo4j Browser or an application connected to your database.
For this example, we will go back to our main Neo4j Desktop window, go to
Manage on the ETL_db database, then click
Open Browser. Executing the
CALL db.schema() on the Browser command line will display the graph data model for you.
Notice that the model does not contain the
Customer Demographics node connecting to
Customer node. This is because the Postgres data set used did not have any data in the customer demographics csv file.
From here, you can run queries, update data, and investigate the data to gather value and insight from structuring your connected data as a graph. The screenshot below shows an initial query (under Favorites -> Sample Scripts -> Basic Queries -> Get some data) to retrieve and display some customers, shippers, orders, and products in the data set.
If you have any questions, do not hesitate to reach out on our neo4j-users Slack community under the
#neo4j-etl channel. As always, happy learning! :)