Importing Neo4j Graph Data with Power BI

Bryant Avey
CodeX
Published in
20 min readMay 20, 2021
Neo4j + Power BI (copyrights reserved by Microsoft and Neo4j)

Making the switch from the awkward complexity of a relational database to a more stream-lined data workflow utilizing a graph-powered database system doesn’t mean that you have to perform a complete purge of your entire technology suite. Many currently ubiquitous technologies can be paired together with graph to facilitate an effective data workflow while also allowing you to take advantage of the investments that you’ve already made in your technology stack. This article shows you how to integrate the graph database, Neo4j, with Power BI, a business intelligence tool that may already be a part of your technology stack. According to both Forrester and Gartner, Neo4j and Power BI are two of the leading technologies.

Specifically, this article covers how to use the default version of the Neo4j HTTP API with Power BI. Utilizing the Web API method allows all queries written for the REST API to be utilized in Power BI’s online service. There are other options for connecting Neo4j to Power BI or other BI tools, including Neo4j’s JDBC Java-based BI connector, or Charlotte Skardon’s Neo4jDataConnectorForPowerBi. In addition to these solutions, it’s also easy to get data into Power BI Desktop using R or Python code.

While these solutions work if you want to manually publish reports to share in the Power BI online environment, Power BI online does not allow automatic refreshing from custom data connectors. Power BI online also does not support the R package for Neo4j, nor does it support the Python packages for Neo4j connections. This means that in order to have Power BI online automatically refresh a report from Neo4j data on a scheduled basis, you cannot use R, Python, Custom Connectors, or JDBC/ODBC connectors. The only option that will allow live automated refreshes from Neo4j is the HTTP REST API.

Microsoft is working to allow custom connectors in the Power BI online environment, but it’s likely this will only be available in the premium version. Presently, custom connectors only work in the on-prem and desktop environment. Anything developed in Power BI Desktop can be published online and manually refreshed in Power BI Desktop and republished online anytime, but it cannot auto refresh without a Microsoft supported connector in the Power BI Online service. This is also the case for both the R and Python packages for Neo4j. R and Python packages for Neo4j can be used in Power BI Desktop and published online, but the online service cannot refresh them. This means you can manually refresh and manually republish Neo4j package-based reports using R and Python, but the refreshing cannot be automated and scheduled.

Before we begin, there are a couple of technical complications with Neo4j’s HTTP API and Power BI that need to be understood and addressed:

1. Neo4j’s HTTP API is a REST, POST-based API. By default, Power BI REST API connectors expect to use GET-based REST calls. Because Neo4j utilizes a POST-based REST API, some custom Power Query coding must be done in Power BI utilizing Power Query Script.

2. Neo4j utilizes web basic authorization or basic auth to authenticate the connection to the Neo4j graph database. This involves creating a 64-bit encoded text string of the database username and password. The code presented in this article does the encoding within the Power Query Script.

3. Power BI requires anonymous connections to URLs, but the rest call can contain authentication in the header of the web call. Some special setup and custom coding of Power Query Script is needed to accommodate this.

4. Neo4j’s HTTP API is not a standardized REST API, meaning you get different JSON structures back depending on what type of Cypher query you submit to the Neo4j database. For this reason, we will need separate query templates for querying Nodes versus querying Relationships versus querying Properties because the resulting JSON structure is completely different coming from Neo4j for each of these types of Cypher queries.

This walk-through will use a local Neo4j database running in Neo4j Desktop Enterprise Edition so that you can see how to connect to a specific database in a multi-graph database environment. For data, we’ll use a super quick and easy-to-create data set that uses a blog feed and blog categories. For this example, I’m using the StarWars.com blog feed. To quickly get the blog data into Neo4j, just use the following Cypher query utilizing the APOC library. (Note: You will need to be sure to install the APOC library in your Neo4j database for this to work.)

Running this Cypher query in your Neo4j database will produce a simple graph containing the blog entries and the corresponding categories of each post:

You can point the URL at any blog or RSS feed. Feel free to change and tweak the query to pull in other information, or to pull data from other blogs. Note that I added a property called ‘a.blog = “StarWars.com,” so if you want to pull in multiple blogs and combine them into a single graph database, just run the query multiple times against multiple blog feed URLS, changing the blog property to match the name of the blog you’re pulling in. That way you can use the blog name to filter nodes by blog name later, if desired.

To make sure the graph db has the data properly working, just run this cypher query in Neo4j:

match p=(a:Article)-[r]-(c) return p

You should see a nice graph returned that shows the categories for each blog:

With that done, we’re now ready to pull the data into Power BI for visualization.

A quick note: If all you wanted to do was pull blog data into Power BI, you could do that without using Neo4j. But in this case, the assumption is that we want to run some graph data science algorithms in Neo4j to utilize relationship information not available in relational databases.

Also, there will be instances where the data in Neo4j does not have any foreign keys to make relationships between tables when pulling Neo4j data into Power BI. This article explores Power BI Query options for dealing with that scenario, since foreign keys are not needed in a graph database to create relationships.

GETTING THE DATA

Let’s start by pulling data into Neo4j using the most common and simple type of Cypher query: getting all the Article nodes. The cypher query we’ll use for this is: “MATCH (a:Article) RETURN a”. This will give us all the node entities with a label of “Article”.

To set this up in Power BI, we’ll use the Power BI Web Connector. The Web Connector is easy and effective because all you do to get the data is connect to a URL. Power BI will automatically figure out what type of additional data connectors are needed to pull the data in from the URL. If a URL points to an Excel file, Power BI will automatically wrap an Excel connector around the Web Connector for you. This is also true with XML or JSON.

I stated earlier that Power BI utilizes a GET request when pulling in REST or Web data from URLs. Neo4j requires a POST request to utilize the Neo4j HTTP API. So, we have to do more than connect to the Neo4j URL to make the connection and get the data. I’ve created a Power Query template that will allow us to easily provide the parameters needed for Neo4j to pull the data into Power BI.

To use the query template, launch Power Query from Power BI by clicking the “Transform Data” button.

In Power Query, from the Home ribbon, click the dropdown on the New Source icon and choose “Blank Query.”

To edit the query, click the “Advanced Editor” icon from the Home ribbon.

Highlight all the text in the blank query and delete it so the query is completely empty.

Paste in the following query:

A few notes about the Power Query Script:

  • This Power Query may seem complicated to read because it combines several steps into a single step to make it easier for users to enter the parameters and not see all the JSON navigation steps. Power Query Script is a functional language, allowing you to wrap functions within functions. Each Step in Power BI is essentially a function, so it can be combined by wrapping a subsequent step around the previous step to combine those steps into a single step. The core of this Power Query Script is a Web.Contents() connector function with several other JSON navigation functions wrapped around it.
  • The Power Query scripts utilize Neo4j’s default data format and not the new Jolt formats.
  • Although it’s not explicitly stated in the query, the Web.Contents() function is doing an HTTP POST call. Power BI automatically performs a post when certain headers are used in the HTTP call. In the case of this query, we pass in Authorization parameters in the HTTP header along with the Cypher query in the body of the web call. Power BI internally sends this as a POST call to the server, even though POST or GET is not explicitly stated in the query.
  • The Basic Authentication requirement of Neo4j is being handled in the portion of the Power Query script that says: “#”Authorization”=”Basic “ & Binary.ToText(Text.ToBinary(__Username & “:” & __Password, null, null))”. Later I’ll show you how to supply the database username and password in the query steps. This bit of code automatically creates the base 64, basic auth token and passes it to Neo4j in the header of the HTTP API call. If you have questions about this, or how it works, feel free to leave a comment.
    This article is an extension of the session I did at the Neo4j NODES 2020 conference Automating the Neo4j Pipeline for Data Shaping and Visualization with Power BI. If you watched the presentation at the NODES 2020 conference, you’ll see how I manually created the Basic Auth token using a Python Script or using the Neo4j HTTP call to get the Basic Auth token for use in passing the credentials.
    This method is an updated version and does everything right in the Power Query script, not needing any external processing to create the base 64 encoding.

After pasting the code in, your query should look something like this:

Click the Done button when finished.

The first time Power BI connects to a new data source it will prompt you to determine how to connect to that data source. When connecting to Neo4j, it’s important that you choose the Anonymous connection. The actual username and password are used in the query, but the web connection to Neo4j needs to be anonymous. If your Neo4j Server requires a Windows, or other credentials, you would specify your credentials to connect to the server — not the Neo4j Database, but to the server itself. If you’re using Neo4j locally on your desktop, choose the Anonymous connection for it to work.

Rename Query1 to “Articles”. When finished you should have a query called Articles, with 6 Applied Steps.

It’s normal for your query to display an error. We need to set the URL, Database Name, User, and Password to make a proper connection to the Neo4j database. To do this, click on each of the applied steps and set the appropriate values.

If you’re not sure what each step does, you can hover the mouse over the information icon on the right of each step to see a description for what to enter at each step. All the steps that require entry parameters, are indicated with double underscores in front of the step name.

Starting at the first applied step in the list, click on the __Neo4jURL step and type in the URL of your neo4j server. If you’re running Neo4j locally on your desktop using the default install values, then your Neo4j server URL is http://localhost:7474. If your server is setup differently or uses a different port number just change the URL listed.

Next, click on the __Neo4jDatabase step and type in the name of your database. I created a new database in Neo4j and called the database in Neo4j “blogs”. If you loaded your data to the default Neo4j database, just type in neo4j or whatever database you want to connect to.

Next, select the __Username step and type in the username for accessing your Neo4j database.

Now, select the __Password step and enter the Neo4j password for the user you specified.

Finally, enter your cypher query in the __CypherQuery applied step. This query template only returns node objects. So, type in a Cypher query that will return only nodes. The first thing we want to get from Neo4j are all the Articles. The Cypher query MATCH (n:Article) RETURN n will give us all the nodes in Neo4j with a label of Article.

Now, when you click on the very last step in applied steps, “Ready to Expand Rows,” you should see a single column called “row” with a list of “Record” rows.

At this point, you have a reusable query template that is successfully connected to Neo4j. This base query will be the foundation for multiple Neo4j Power Queries going forward. As new types of Cypher queries run, which return different types of JSON structures, we will create new queries to handle the various JSON results, hierarchy trees.

Since we will be reusing this base query, it’s a good idea to make a duplicate copy of it and rename it as “Neo4jNodeTemplate”. Once the query is duplicated, you can disable it, so it won’t load. To do this, right click the query and uncheck the “Enable load” checkbox. This way you can duplicate this query for the other types of Cypher queries that we’ll make later, and you won’t have to deal with deleting any unused and unneeded query steps.

Now we’re ready to go back and finish the Article query. Since we started with an easy query that just returns nodes, we have only one additional step to add to get the node properties. Select the Article query and then click the column expansion icon at the top-right of the column called “row.”

A drop-down dialogue window will show up containing a list of the columns in each record. Be sure to uncheck the “Use original column name as prefix” at the bottom of the dialogue box. If you leave it checked, you’ll end up with column names that have the original column, which is currently called “row” followed by a period and then the new column name. So, you would end up with row.id, row.blog, row.body, etc. To avoid this, uncheck the option to use the original column name as a prefix. When done, click the OK button.

You will see that a new applied step has been added called “Expand row”, and you will now see the 4 columns with the Article data from the Article nodes from Neo4j.

Pulling in other nodes is just as easy. Duplicate the Neo4jNodeTemplate Query by right clicking it and choosing duplicate, just like above. Call the new query “Categories”.

In the Categories, query select the __CypherQuery step and change the Cypher statement to pull in the category nodes: “MATCH (n:Category) RETURN n.”

Next, select the last step “Ready To Expand Rows”, click the column expansion icon at the top-right of the column called “row”. Expand the row column to include the columns and hit OK.

Finally, since our Neo4jNodeTemplate query is not enabled, right click the Categories query and enable it to load. When you do this, Power BI will ask you how you want to connect. Since Neo4j does not allow DirectQuery mode (this is reserved for SQL Server and other relational databases), choose “Import” mode.

You will now have a table of categories.

Next, we need to get the relationships between the Article and Category so that we can connect them in Power BI. This can be accomplished in much the same way as getting the Article and Category nodes; however, Neo4j has a different JSON result tree structure when returning relationships than it does for returning nodes. To process the relationships, we’ll have to go through a few more steps to get the data.

Additionally, the HTTP API returns different results when running the same Cypher Query in the Neo4j Browser. Let’s take a look at the difference to fully understand the issue and to determine the best way to get the relationship data from Neo4j.

To use graph relationships from Neo4j in Power BI, the data must first be converted to what’s called an associative table. This associative table ties together the Category and Article tables that were already created in the previous process. To create the associative table in Power BI, we need the starting and ending node id from Neo4j so that we know which Article nodes are connected to which Category nodes.

When we run the Cypher query “MATCH ()-[r:IN_CATEGORY]-() RETURN r” in the Neo4j Browser to get the relationship data, we seem to get what we need.

JSON Structure returned from the Neo4j Browser

The Neo4j Browser displays an array of properties with the relationship identity, the start node id, end node id, the type, and the relationship properties. This looks good. But wait… There is a catch. As I mentioned previously, the Neo4j HTTP API does not return the same JSON structure as the Neo4j Browser.

JSON structure from Neo4j’s HTTP API

While the Neo4j Browser returns all the information you need, The API does not. It eliminates the start and end node id, the two things you need for the associative table. The Neo4j Browser performs a lot of extra steps behind the scenes, one of those extra steps is to run queries that provide additional information that is not provided directly from the Cypher Query.

It’s up to the data analyst to write a Cypher query that will return all the relationship data needed for Power BI to make the connection between tables, simulating the Neo4j relationships. The following Cypher Query will provide the needed data and this query will work in the Neo4j Browser as well as in the HTTP API.

MATCH (a)-[r:IN_CATEGORY]-(c) RETURN id(r) as RelationshipId, id(a) as ArticleNodeId, id(c) as CategoryNodeId, type(r) as RelationshipName

The Cypher Query has to be written so that it returns the desired properties of both the relationships and the nodes. This is called a Cypher Property query as it returns only properties. The earlier queries we used for Articles and Categories were object queries, as they returned node objects. Our first attempt to get the relationships was also an object query returning a relationship object. It’s important to always test your queries in the Web API and not just rely on what you see in the Neo4j Browser.

(Note: I use Postman to test out the HTTP API calls and look at the resulting data)

Neo4j Query results

The HTTP API resulting JSON looks like this:

JSON results from the Neo4j HTTP API

The Neo4j API returns a JSON “results” set of “columns” with the column names, and then it returns the “data” in the “row” section of the data group within the results block. This is similar to the JSON that is returned when querying and returning nodes, but it’s a slightly different structure, so we need a different Power Query Script to transform Cypher property queries.

One of the differences between returned browser results and returned API results is that column names are separated into a different section of the JSON result hierarchy. The following Power Query Template script will allow Power Query to get both the column names and the data from the JSON results that get returned from Neo4j. This 2nd Power Query script template will handle any Cypher query that uses properties, as opposed to using node or relationship objects as return values.

Neo4j result sets are not standardized so it’s important to write consistent queries when using the HTTP API. You would not want to mix using RETURN statements that return both objects and properties. For example, a statement of: RETURN n.property, n. This is because the resulting JSON will put the values from n in a separate JSON hierarchy tree from the n.property values. The 1st query, Neo4jNodesTemplate, returns objects. This next query will return only named property values.

(Note: It’s probably possible to make one universal Power Query Script that would work with most or multiple types of Neo4j Cypher return result sets, but for this article we’ll keep the two types of queries separate.)

(Tip: When writing Cypher queries for use in the HTTP API, use all named properties or only return objects.)

Copy and paste the following Script into a new blank Power Query and name it “Neo4jPropertyTemplate”.

It should be noted that the first part of the query is identical to the Neo4jNodeTemplate. Where it differs is that we have to return both the column section of the JSON tree and the data section of the tree. To do this, it was necessary to create a variable in the query called “DataWithColumnHeaders”. This variable preserves both the [column] and [data] JSON tree structures in the Neo4jResults variable. The [column] JSON tree gets used near the bottom of the query where it reads: #“Neo4jResults”[columns]. The query pulls the JSON structure in and processes the data, then it separately extracts out the column data and inserts a row with the data at the top of the resulting table. With the column names row inserted at the top it uses a function to promote the 1st row of data as the column header.

This is a query that provides an example of “Pro-Code” vs “No-code.” Several functions were combined to minimize query steps in the template, hopefully making it easier for Power BI report writers to reuse these queries by simply filling in the double underscore or “dunder” steps to get the cypher query to run.

With the Neo4jPropertyTemplate query written, you can duplicate the query and call the new query IN_CATEGORY. This query will give you the Relationship ID along with the starting and ending nodes and the name of the relationship. When finished, you should have something that looks like this:

Both query templates should be set to disabled and the Articles, Categories, and IN_CATEGORY should be set to enable load.

If you click the “Close & Apply” button in the home ribbon, the data will be loaded into Power BI for use.

But wait… We have a problem!

Notice that Node-based queries don’t have the node id. There is no way to really connect the Article to the Category. The data from relationships is available, but the node result set from the Neo4j HTTP API doesn’t provide the full set of data.

To get the full set of data, the 2nd query template will have to be run to recreate the Articles and Categories queries using the named object properties to get the node id to make the connections. To do this, right click the Neo4jPropertyTemplate query and duplicate it, renaming it to Article. Do the same thing for Category. To return the node properties for the 2 queries, use the Cypher object.property notation and the function id() to get the node id.

For the Article query use the Cypher query:

MATCH (n:Article) RETURN id(n) as ArticleNodeId, n.id as Id, n.blog as Blog, n.body as Body, n.title as Title

For the Category query use:

MATCH (n:Category) RETURN id(n) as CategoryNodeId, n.name as Name

When the query is complete, you will have an Article and Category query with the node id. You can remove or disable the Articles and Categories queries and enable the Article and Category query to load.

Now when you choose close and apply, Power Bi will automatically create the relationships in the model between the Category, IN_CATEGORY, and Article tables.

With everything modeled and setup, you can begin visualizing the data. Power BI automatically uses the relationships to apply filters to the data. If you drag out the Category name as a table or slicer, then drag out the blog title and body to another table, you can select a category and see all the blog entries for that category, etc.

An example of category slicers filtering blog titles and body text in Power BI

Some final thoughts: Depending on the data you want to use it may be easier to use the Node Template. If the nodes in Neo4j have property keys, the Node Template can be used very effectively to create relationships between the Associative Relationship Tables and Node Tables, or directly between nodes using the node properties as foreign keys.

I tend to use the Property Template. It makes it necessary to spell out all the node and relationship properties I want when writing Cypher code, but it also allows for the creation of specific relationships, and column names for visualization and modeling. Power BI will automatically build relationships between tables where the two tables have qualifying foreign key column names and one-to-many cardinalities. Sometimes it’s easier to do things in Cypher, and sometimes it’s easier to do them in Power BI DAX or Power Query. The Property Template gives you that flexibility.

You can download hundreds of visualizations for Power BI, including things like traditional force graphs, to show graph-style visuals in addition to various other types of visuals.

An example of a Power BI Network Navigator visual
An example of a Power BI force graph visual

Another really good way to explore graph data is through a Decomposition Tree. Just load up the nodes and properties in the Decomp Tree visual and start drilling down to discover how your data stacks up.

An example of a Power BI Decomposition Tree visual

In addition to the thousands of visuals available in the Power BI AppSource, Python and R visuals can also be used directly in Power BI using the R and Python visuals.

Default Power BI Visualizations highlighting the Python visual

If you have the On-Premises Data Gateway installed for your Office 365 environment, once you publish your Power BI reports to the Power BI online service, you can schedule automatic refreshes of your Neo4j graph data and provide dashboards, drill-in reports, and everything else Power BI offers.

Power BI also allows you to integrate other data sources with your Neo4j data. So, if your report writers don’t have Neo4j Graph skills, they can still utilize Neo4j data for analysis and reporting.

Let me know if you have questions or suggestions. I’d love to hear how you use this in your environment.

--

--

Bryant Avey
CodeX
Writer for

Bryant is a Solution Architect at InterNuntius, a data integration and analytics consultancy. He is also the Chief Data Officer for Stratalytica, a nonprofit.