How to Connect Neo4j Graph Database to Data Sources?

Mouna Challa
Aarth Software
Published in
16 min readMay 4, 2023

Connecting Neo4j Graph database to data sources allows you to leverage the power of graph technology in combination with various data stores.

This guide will walk you through the process of connecting Neo4j to different types of data sources, enabling you to unlock valuable insights from your diverse data ecosystem.

  • Neo4j is a highly scalable, ACID-compliant graph database that uses the property graph model.
  • It allows you to represent complex relationships and perform efficient graph traversals, making it ideal for applications that heavily rely on interconnected data.
  • Neo4j offers a powerful query language called Cypher, designed specifically for working with graph data.

Benefits of Connecting Neo4j to Data Sources

  • Holistic Insights: By connecting Neo4j to data sources, you can integrate and analyze data from multiple systems, gaining a comprehensive view of your organization’s data landscape.
  • Relationship Analysis: Neo4j’s graph database model excels at capturing and analyzing relationships, enabling you to uncover hidden patterns and dependencies.
  • Real-time Decision Making: Connecting to data sources empowers you to query and analyze data in real time, making informed decisions based on up-to-date information.
  • Enhanced Performance: By combining Neo4j with various data sources, you can optimize data retrieval and processing, improving overall system performance.
  • Flexible Data Integration: Connecting Neo4j to data sources enables seamless data integration, allowing you to leverage existing data investments and systems.

Connect Neo4j to Relational Databases with Neo4j ETL Plugin

Neo4j has created an ETL (Extract-Transform-Load) tool to make it easier for developers to convert relational data structures into graph databases. The tool helps create a streamlined and efficient process for loading such data.

Neo4j provides a 3-step process to import data from a relational database into its graph.

It starts with setting up a JDBC connection and then editing the data model mapping that has been created. Finally, all the data is imported into Neo4j.

Obtaining and utilizing the tool is simple and straightforward. The Neo4j Tool can be used in two different ways — either through the command line or through the Neo4j Desktop application.

Working with the command line is made easy with this GitHub project. All you need to do is download it and run each step at a command prompt.

Benefits of connecting Neo4j to Relational Databases

  • Comprehensive Data Analysis: By connecting Neo4j to relational databases, you can perform comprehensive analysis by leveraging graph-based querying and visualization on relational data.
  • Relationship Mapping: Neo4j’s graph database model is ideal for representing and analyzing complex relationships, enabling deeper insights into the interconnectedness of data stored in relational databases.
  • Data Enrichment: Integration with relational databases allows for data enrichment, combining graph data with additional context and details from structured sources.
  • Performance Optimization: Neo4j’s optimized graph querying capabilities enhance performance when traversing relationships in relational data, enabling faster and more efficient analysis.
  • Data Synchronization: ETL plugins enable continuous or one-time synchronization between Neo4j and relational databases, ensuring that changes made in either system are reflected in the other.

Step 1: Prerequisites

For those who haven’t done it yet, downloading Neo4j and following the accompanying instructions is necessary for setting up a project and database. The instructions can be seen when you download the software.

To get the Neo4j ETL tool, make sure you have a project and database instance setup on Neo4j Desktop.

Then go to the Graph Applications tab and paste this link (https://install.graphapp.io/neo4j-etl-app) in the Install Box before clicking Install.

After you’ve finished the necessary steps, you’ll be able to view the ETL application. Now, it’s time to link it along with your graph applications.

Step 2: Verify ETL Tool is Listed and Open

You should now have access to the application. On the left side of the page, you can open the Graph application by clicking on its icon, which is located next to the ETL Tool.

This will let you access it regardless of any current project or database and it doesn’t require an active database.

Step 3: Choosing a Project

Upon starting the ETL Tool, you may be prompted to allow some background processes to run.

It is recommended to select ‘Allow’, which will finish loading the application and bring up its main screen.

On this page, you will see a default project and the graph database instances associated with it. In case you do not have any such databases created, the screen will look like the one shown below.

Before advancing any further, you must create a graph using an ETL Tool which can be run in a local or remote instance.

Once you have created the database, you have to decide which project to focus on.

This will determine which Neo4j databases will be available for you to import the data into at a later stage.

To the right, a list of available databases is present for the project.

Since there are no connections on the left side, we don’t have any relational databases defined in our tool, thus we need to click ‘ADD CONNECTION’ to set one up.

Step 4: Set Up a Connection to Relational Database

When creating a new connection, you will be guided to a form where you should enter the necessary details.

The tool can easily connect to various types of relational databases by using its JDBC Driver, such as MySQL, PostgreSQL, Oracle, Cassandra, DB2, SQL Server, and Derby.

MySQL and PostgreSQL databases come pre-integrated with the tool. But if you’re looking to use external databases, all you need to do is specify a driver file for easy setup.

For this example, select Postgresql as the option. The port and connection URL has already been filled in correctly for this choice.

Now, you must fill in the remaining information. You can give any name to your Connection Name field, but it is essential that the Database field matches the exact name of your relational database.

If you have supplied a username and/or password while setting up your relational database, you’ll need to enter them into their respective fields.

On the other hand, if you didn’t specify either of those credentials while creating the database, you can leave one or both of these empty.

When working with databases other than Postgres or MySQL, remember to select a suitable driver file for your tool to facilitate the connection.

After completing the details, click the Test and Save Connection button located at the bottom of the window.

If everything is successful, you will receive a confirmation message in blue at the top of your screen. If any of the fields are incorrectly filled or the relational database is unable to be found, an error message will appear as a red bar on your screen.

Step 5: Choose the ‘From’ and ‘To’ Databases for Import

After successfully establishing a connection with a relational database, we have to determine what will be loaded into the graph database.

This can be done by selecting the source of data, i.e. the relational database connection in the left list, and then selecting the desired destination i.e., the graph database in the right list that is meant to receive this data. Then you can click the Start Mapping button.

After completing a step, a blue message bar will appear at the top of your screen to indicate successful completion.

Alternatively, if the step fails, you will be notified with a red message bar. In our case, since the step was completed successfully, the Start Mapping button will be inactive and the Next button can now be pressed to proceed.

In the event of a failed step, you can investigate further by clicking the SEE LOGS button located at the bottom to debug.

Step 6: Adapt Your Metadata

Upon clicking the “Next” button at the bottom right, you will be directed to the data model page. Here, you can alter any of the mappings including property names, data types, and bonds if need be.

On the left side of the page, you can find two tabs — one showing all the Nodes and the other one listing all Relationships.

To update any of them, simply click on it in the list. The tool lets you leave out nodes and connections that are not necessary to be imported into the graph.

To edit the details of mappings, you can click on the pencil icon next to an entity in the list or double-click on the said entity in the visualization.

This will open a popup box that contains all fields and give you options for making any changes. Clicking ‘Save’ will commit your modifications to the graph.

Step 7: Import the Data

Neo4j offers multiple approaches to move data into the graph database depending on its condition. Every technique has different prerequisites and advantages which are described in detail below.

  1. If the database is running (works for both local and remote instances)

Online Direct: It runs via a BOLT connection to import, turning SQL results into Cypher parameters.

Online Batch: Using CSV files from the mapping stage, one can import batches over a BOLT connection.

2. If the database is shut down

Bulk Import: Fast loader for initial load with offline import (running the Neo4j-Admin import tool).

NOTE: It is essential to take into consideration whether your Neo4j database is active or not when importing data.

By clicking the ‘Import Data’ button, you can initiate automatic commands that will perform their task in the background.

Once it has been completed, a blue message bar will appear at the top of your window to notify you that the import was successful and display its output result at the bottom of your screen. (screenshot below).

Step 8: Check Imported Data

To ensure a successful data loading process and start using the Neo4j database, closely examine the data model.

From there you can run queries directly from Neo4j Browser or an application connected to your database.

To view the graph data model, we will first access the Neo4j Desktop window, click on ‘Manage’ for the ETL db database and open the Browser.

After that, run ‘CALL db.schema()’ from the Browser command line to display it. Your relational data has now been transformed into a graph!

In this example, the model does not have the Customer Demographics node connected to the Customer node because of the lack of customer demographics data in the Postgres dataset.

Connecting Neo4j to NoSQL Databases

Integrating Neo4j with NoSQL databases enables you to combine the power of graph-based modeling with the flexibility and scalability offered by NoSQL technologies.

This integration allows you to leverage the strengths of both Neo4j and the respective NoSQL database, enabling advanced analysis and efficient data management.

Benefits of connecting Neo4j to NoSQL Databases

  • Enhanced Data Relationships: By connecting Neo4j to NoSQL databases, you can capture and analyze complex relationships within your data, providing deeper insights and facilitating more accurate decision-making.
  • Scalability and Flexibility: NoSQL databases excel at handling large volumes of unstructured or semi-structured data, and integrating Neo4j with NoSQL databases allows you to scale your graph-based applications effectively.
  • High Performance: The combination of Neo4j’s optimized graph querying and NoSQL database’s high throughput capabilities allows for efficient and fast data retrieval and processing.
  • Data Enrichment: Connecting Neo4j to NoSQL databases enriches graph data by incorporating additional information from diverse data sources, enabling comprehensive analysis and exploration.
  • Real-time Data Processing: Integration with NoSQL databases enables real-time data ingestion and processing, empowering you to make timely decisions based on the most up-to-date information.

Connecting to MongoDB Database

The Neo4j Doc Manager project enables the synchronization of meta data from MongoDB to Neo4j, taking advantage of the relationships in the data that Neo4j is best at.

This allows for easier access to data than MongoDB and can be very useful in a variety of situations.

Neo4j Doc Manager enables you to work with MongoDB documents in a Neo4j graph format.

The documents are structured in the same manner as specified by Mongo Connector, thereby enabling easy querying for relationships.

This service allows you to sync your data from MongoDB to Neo4j in real-time so that you can benefit from the advantages of both databases in your application.

Step1 : Installation

To use the Neo4j Doc Manager, Python should be installed on your system.

The most recommended way to do so is by using the ‘pip’ package manager of Python (You might need sudo privileges).

pip install neo4j-doc-manager

Step 2: Using Neo4j Doc Manager

To be able to use Neo4j, you need to make sure that the instance is up and running.

Additionally, if authentication has been enabled with version 2.2 or higher, you need to set the environment variable ‘NEO4J AUTH’ with your username and password.

export NEO4J_AUTH=user:password

To set up MongoDB as a replica set, start the Mongo process with the following command. This helps to ensure that you are running a replica set of Mongo. Additionally, it is important to regularly monitor the health of your replica set in order to maintain optimal performance.

mongod --replSet myDevReplSet

Next, launch the Mongo shell and execute the command:

rs.initiate()

For additional information, kindly check the Frequently Asked Questions section of Mongo Connector.

To begin using mongo-connector, enter the command below in the terminal:

mongo-connector -m localhost:27017 -t http://localhost:7474/db/data -d neo4j_doc_manager

-m provides Mongo endpoint

-t provides Neo4j endpoint. (Make sure to specify the protocol (HTTP)).

-d specifies Neo4j Doc Manager.

Step 3: Data synchronization

Neo4j Doc Manager allows for documents stored in MongoDB to get converted into a graph structure and added to Neo4j.

It does this by transforming the keys of the documents into nodes, and then extracting the nested values associated with each key as properties.

Connecting to Cassandra Data Source

Neo4j is a great option for working with graph-based data since it is specifically designed for that purpose.

On the other hand, Cassandra is suitable for larger set of structured and semi-structured data.

Combining two databases provides an effective way to conduct graph-based analysis on large datasets.

This allows you to leverage the strengths of each database for a more comprehensive analysis.

To maximize the efficiency of Cassandra’s query language, the columns and column families should be designed in such a way that they are optimized for reading data.

This may result in duplication of data as you will have to create new tables with the same content, but tailored to suit your queries better.

Now, we know that Neo4j is very adept at handling relationships, so it may be wise to bring some of our event log data into Neo4j to run some fraud detection Cypher queries.

Step 1: Neo4j-Cassandra data import tool

Neo4j-Cassandra data transfer tool allows you to migrate data from Cassandra’s column-oriented model into a Neo4j property graph.

It does this by inspecting the Cassandra schema and letting you define how the data should be mapped.

Make sure that both Cassandra and Neo4j are up and running properly.

Next clone this repository: git clone https://github.com/neo4j-contrib/neo4j-cassandra-connector.git

Install project dependencies using

pip install -r requirements.txt -

Step2: Populating an initial Cassandra Database

For this example, a sample database of musicians and their songs will be utilized.

To get a Tracks and Artists database into your Cassandra database, navigate to the ‘db gen directory’, start the cqlsh shell, then execute the ‘SOURCE ‘/playlist.cql’ command (including absolute path if not in same directory). This will populate your database with the sample content.

Step 3: Inspect the Cassandra schema

To correctly establish a connection between Cassandra and a graph, you must populate the initial database and create a file. This will enable you to map the Cassandra Schema effectively.

  • Go to the project directory and select the subfolder ‘connector’.
  • Execute the script ‘connector.py’ using Python by running the command ‘python connector.py parse -k playlist’
  • After processing, you will have a schema.yaml file generated. This contains a YAML representation of the Cassandra schema along with details on converting it into Neo4j’s property graph data model format. Edit this file to specify the graph structure you would like to use to import your data.

Step 4: Configure data model mappings

To be able to bring data from Cassandra into Neo4j, the schema.yaml file needs to be populated with details about the mapping between both structures. This will ensure that elements in the Cassandra structure are properly represented in Neo4j’s property graph model.

CREATE TABLE playlist.artists_by_first_letter:
first_letter text: {}
artist text: {}
PRIMARY KEY (first_letter {}, artist {})
CREATE TABLE playlist.track_by_id:
track_id uuid PRIMARY KEY: {}
artist text: {}
genre text: {}
music_file text: {}
track text: {}
track_length_in_seconds int: {}
NEO4J CREDENTIALS (url {}, user {}, password {})

Neo4j uses Nodes to represent each table in a database, allowing for graph-like model navigation and analysis. Every node created in Neo4j will have a label corresponding to the keyspaces from Cassandra.

It’s possible to fill {} with the following options

  • p, for regular node property (fill with {p}),
  • r for relationship (fill with {r}),
  • u for unique constraint field (fill with {u})
  • i to create an index on this property (fill with {i})

Neo4j address and credentials are required at the end of this file. If you have disabled authentication, the user and password fields can be left blank.

NEO4J CREDENTIALS (url {“http://localhost:7474/db/data"}, user {“neo4j”}, password {“neo4jpasswd”})

Step 5: Import to Neo4j

Once you have added all the required information to the empty brackets, remember to save the file and then run connector.py. This time, specify which tables you want to export from Cassandra:

python connector.py export -k playlist -t track_by_id,artists_by_first_letter

If you want to change the name of the YAML file from its default ‘schema.yaml’, you can use a CLI argument to specify it.For example:

python connector.py export -k playlist -t track_by_id,artists_by_first_letter -f my_schema_file.yaml

Step 6: Mapping data into Cassandra to Neo4j

YAML files can be converted into Cypher queries which can then be used to create Nodes and Relationships in a graph structure. A file called ‘cypher_’ will appear in your directory, containing these Cypher queries. Once created, Py2Neo automatically sends the queries to Neo4j using the parameters specified in schema.yaml file.

Importing JSON Data With REST API to Neo4j

Neo4j lets you retrieve data from JSON-based Web APIs and transform it into map values for Cypher’s use. There are numerous JSON-based Web APIs you can import into Neo4j, and one of them is the Load JSON procedure.

The Load JSON procedures enable you to retrieve data from URLs or maps and convert it into map values, which can be used in Cypher. Cypher provides helpful features like dot syntax, slices, and UNWIND that can make it easier to transform data into useful graphs.

Procedure Overview

The following are the available procedures

apoc.load.json

apoc.load.json('url',path, config) YIELD value

Import JSON as a stream of values if the JSON is an array, or a single value if it is a map. This procedure enables you to parse JSON files or those available on a HTTP URL into a map data structure.

Signature:

apoc.load.json(urlOrKeyOrBinary :: ANY?, path = :: STRING?, config = {} :: MAP?) :: (value :: MAP?)

apoc.load.jsonParams

apoc.load.jsonParams('url',{header:value},payload, config) YIELD value

Load from a JSON URL (e.g. web-api) while sending headers and payload to import JSON as a stream of values if the JSON is an array, or a single value if it is a map. The apoc.load.json procedure can be used to parse a file or URL in JSON format and store it into a map data structure. It offers customizable options such as HTTP headers and JSON payloads, making it more flexible than other procedures of the same kind.

Signature:

apoc.load.jsonParams(urlOrKeyOrBinary :: ANY?, headers :: MAP?, payload :: STRING?, path = :: STRING?, config = {} :: MAP?) :: (value :: MAP?)

apoc.load.jsonArray

apoc.load.jsonArray('url') YIELD value

Load an array from a JSON URL (e.g. web-api) to import JSON as a stream of values. The process of converting a JSON array stored in a file or an HTTP URL into a stream of maps is done using this procedure.

Signature:

apoc.load.jsonArray(url :: STRING?, path = :: STRING?, config = {} :: MAP?) :: (value :: ANY?)

apoc.import.json

apoc.import.json(file,config)

Import the JSON list to the provided file.This step can be used to import JSON files created with Export JSON and exported using the config parameter jsonFormat set as ‘JSON LINES’ (default setting).

Signature:

apoc.import.json(urlOrBinaryFile :: ANY?, config = {} :: MAP?) :: (file :: STRING?, source :: STRING?, format :: STRING?, nodes :: INTEGER?, relationships :: INTEGER?, properties :: INTEGER?, time :: INTEGER?, rows :: INTEGER?, batchSize :: INTEGER?, batches :: INTEGER?, done :: BOOLEAN?, data :: STRING?)

Importing from a file

In apoc.conf, you can enable importing from the file system by setting a specific property. This is disabled by default and needs to be enabled manually.

apoc.import.file.enabled=true

Using JSON paths

JSON paths offer a concise format for extracting and managing sub-documents & values from complex JSON structures. This is an efficient way to read and process data from nested formats.

You can use JSON paths instead of JSON files for Cypher queries, which can help shorten the query length for nested JSON data. This way, you won't have to unwind each object when accessing the required substructures. The JSON path format follows the Java implementation by Jayway of Stefan Gössner’s JSONPath, providing a consistent syntax for the paths.

The apoc.convert.Json procedures and functions, as well as the apoc.load.json process, accept json path as their last argument. These functions are intended to stream arrays of values or objects, but not just single values. If a single item or value is specified as the path, the function may not return the expected results.

The apoc.json.path(json,path) method takes a JSON string rather than a map or list and fetches values from the given json path as the second argument. Should you need to convert the JSON into string format, you may use the apoc.convert.toJson function for that purpose.

Once you import the directory in the Neo4j instance and write a query using any of the procedures, you will get a get back a map that looks almost the same as the JSON document. You can now extend that query to create a graph based on this JSON file.

Best Practices for Connecting Neo4j to Data Sources

  • Understand Data Model: Ensure a clear understanding of the data model in Neo4j and the schema/structure of the data sources to design an effective integration strategy.
  • Data Mapping: Map the data from the source to the graph model in Neo4j appropriately, preserving relationships and ensuring data consistency.
  • Incremental Updates: Implement mechanisms to handle incremental updates, syncing changes between Neo4j and data sources efficiently.
  • Security Considerations: Follow security best practices, including securing access credentials, implementing encryption, and adhering to data privacy regulations.
  • Performance Optimization: Optimize queries, indexing, and caching strategies to ensure efficient data retrieval and processing.

Conclusion

Connecting Neo4j to data sources provides a range of benefits, expanding the capabilities of graph-based analysis and enabling comprehensive insights.

By integrating with various data sources, Neo4j allows you to enrich data, analyze relationships, make real-time decisions, and achieve better performance.

--

--