Use Python to create network graphs in Tableau

Daan Tor
Daan Tor
Feb 25, 2017 · 6 min read

Network graphs are very useful visualisations to look for patterns in large sets of data. In this post the goal is to create a network graph in Tableau. There are better plotting tools than Tableau for this, but we are going to do it anyway. Because we can. That’s why.

A network of U.S. senators and their social networks.

Now why would you use Python instead of NodeXL or Gephi for the data processing? Because Python is easier to run in production environments than dedicated tools. Python is more flexible and can be scheduled easily and can therefore be included in ETL processes :)

Our case

For this post we are going to pretend to be a hospital Business Analyst. The hospital wants to know if customer life cycles can be optimised. The main question that the hospital wants answered is the following: Are our clients getting all their possible treatments from us?

For this case we (yes, you and I ;) ) decide that the following hypotheses should (at least partly) answer the aforementioned question.

  • All clients that are buying product X have bought product Y at one point before
  • Product A is an ‘entry-product’ for the network of product-group X
  • No one buys anything anymore after product B
  • A client enters our network of products via product Q.

Approach

Our goal is to go from a fact-based table of transactions to a network diagram in Tableau. For this we will do the following:

  • Determine what connections products have in SQL, and in what order (because our data is in SQL Server).

After this we should have 2 csv files. An edge list (with weight) like so:

[table id=1 /]

And a node-list:

[table id=2 /]

If you already have these you may skip the first part.

  • Plot a graph in Python
  • Enrich and shape the graph in SQL for usage in Tableau

Disclaimer: We have a ‘few’ popular products in this case. The results for (for example) a retailer with thousands of products would be sub-optimal because there would be a mess of a network. There are various ways to solve this, but that’s not in the scope of this post.

SQL Part I: Forming facts into a network

For the data preprocessing we need to figure out which products were sold to who, which order that happened in, and reshape this data into 2 lists: 1 list of nodes, and 1 list of edges (lines between nodes).

We start by reducing the Fact_Transaction set to a table of the Client_ID, order in which the client was buying. This is inserted into the temporary table #Client_Date for later use. (FYI: Transaction_Order means the sorted order of transactions, it is not an “order” that a client placed)

SELECT  Fact_Transaction.Client_ID
, Fact_Transaction.Date AS First_Transaction_Date
, ROW_NUMBER() OVER ( PARTITION BY Client_ID ORDER BY Fact_Transaction.Date )
- 1 AS Transaction_Order
INTO #Client_Date
FROM Fact_Transaction
WHERE Fact_Transaction.Client_ID IS NOT NULL
AND Fact_Transaction.Product_ID IS NOT NULL
GROUP BY Fact_Transaction.Client_ID
, Fact_Transaction.Date
ORDER BY Fact_Transaction.Client_ID
, Fact_Transaction.Date

Determine the first time a product was bought by a client.

SELECT Fact_Transaction.Client_ID
, Fact_Transaction.Product_ID
, MIN(Fact_Transaction.Date) AS First_Transaction_Date
INTO #Client_Date_product
FROM Fact_Transaction
WHERE Fact_Transaction.Client_ID IS NOT NULL
AND Fact_Transaction.Product_ID IS NOT NULL
GROUP BY Fact_Transaction.Client_ID
, Fact_Transaction.Product_ID
ORDER BY Fact_Transaction.Client_ID
, MIN(Fact_Transaction.Date)
, Fact_Transaction.Product_ID

Join the tables to determine the order in which the client bought items.

SELECT #Client_Date_product.Client_ID
, #Client_Date.First_Transaction_Date
, ROW_NUMBER() OVER ( PARTITION BY #Client_Date_product.Client_ID
ORDER BY #Client_Date.Transaction_Order ) AS Transaction_Order
INTO #Transaction_Order
FROM #Client_Date_product
LEFT OUTER JOIN #Client_Date ON
#Client_Date_product.Client_ID = #Client_Date.Client_ID
AND #Client_Date_product.First_Transaction_Date =
#Client_Date.First_Transaction_Date
GROUP BY #Client_Date_product.Client_ID
, #Client_Date.Transaction_Order
, #Client_Date.First_Transaction_Date
ORDER BY #Client_Date_product.Client_ID
, #Client_Date.Transaction_Order

Join the tables to make a pretty and readable version with Client, Product, Date and Order.

SELECT #Client_Date_product.Client_ID
, #Client_Date_product.Product_ID
, #Client_Date_product.First_Transaction_Date
, #Transaction_Order.Transaction_Order
INTO #Root_Table
FROM #Client_Date_product
INNER JOIN #Transaction_Order ON #Client_Date_product.Client_ID
= #Transaction_Order.Client_ID
AND #Client_Date_product.First_Transaction_Date
= #Transaction_Order.First_Transaction_Date
ORDER BY #Client_Date_product.Client_ID
, #Client_Date_product.First_Transaction_Date
, #Client_Date_product.Product_ID

Join the table with itsself to shape each row into a connection with a start and end of connection.

SELECT START.Client_ID AS START_Client_ID
, START.Product_ID AS START_Product_ID
, END.Client_ID AS END_Client_ID
, END.Product_ID AS END_Product_ID
INTO #Flows
FROM #Root_Table START
LEFT OUTER JOIN #Root_Table END ON START.Client_ID
= END.Client_ID
AND START.Transaction_Order + 1 = END.Transaction_Order
WHERE END.Product_ID IS NOT NULL
ORDER BY START.Client_ID
, START.First_Transaction_Date

Aggregate the table to determine the “weight” for each unique connection. Also, make a list of product ID’s.

SELECT #Flows.START_Product_ID AS Source
, #Flows.END_Product_ID AS Target
, COUNT(*) AS Weight
INTO #edges
FROM #Flows
GROUP BY #Flows.START_Product_ID
, #Flows.END_Product_ID


SELECT Dim_Products.Product_ID AS ID
INTO #nodes
FROM Dim_Products

These 2 tables (#nodes and #edges) should be copied and pasted into ‘edges.csv’ and ‘nodes.csv’ files for our next step! Makes sure it is all in the same folder.

Python and networkx: Transforming your network data to a network plot

SQL Part II: Enriching coordinates data

Now I’ve loaded the result csv into SQL Server as ‘Coordinate_Table’. And the first step is to run much of the same script again. This is the same script as above, but with the addition of the node-coordinates found in the csv.

SELECT Fact_Transaction.Client_ID
, Fact_Transaction.Date AS First_Transaction_Date
, ROW_NUMBER() OVER ( PARTITION BY Client_ID ORDER BY
Fact_Transaction.Date )
- 1 AS Transaction_Order
INTO #Client_Date
FROM Fact_Transaction
WHERE Fact_Transaction.Client_ID IS NOT NULL
AND Fact_Transaction.Product_ID IS NOT NULL
GROUP BY Fact_Transaction.Client_ID
, Fact_Transaction.Date
ORDER BY Fact_Transaction.Client_ID
, Fact_Transaction.Date


SELECT Fact_Transaction.Client_ID
, Fact_Transaction.Product_ID
, MIN(Fact_Transaction.Date) AS First_Transaction_Date
INTO #Client_Date_product
FROM Fact_Transaction
WHERE Fact_Transaction.Client_ID IS NOT NULL
AND Fact_Transaction.Product_ID IS NOT NULL
GROUP BY Fact_Transaction.Client_ID
, Fact_Transaction.Product_ID
ORDER BY Fact_Transaction.Client_ID
, MIN(Fact_Transaction.Date)
, Fact_Transaction.Product_ID

SELECT #Client_Date_product.Client_ID
, #Client_Date.First_Transaction_Date
, ROW_NUMBER() OVER ( PARTITION BY #Client_Date_product.Client_ID
ORDER BY #Client_Date.Transaction_Order ) AS Transaction_Order
INTO #Transaction_Order
FROM #Client_Date_product
LEFT OUTER JOIN #Client_Date ON #Client_Date_product.Client_ID
= #Client_Date.Client_ID
AND #Client_Date_product.First_Transaction_Date
= #Client_Date.First_Transaction_Date
GROUP BY #Client_Date_product.Client_ID
, #Client_Date.Transaction_Order
, #Client_Date.First_Transaction_Date
ORDER BY #Client_Date_product.Client_ID
, #Client_Date.Transaction_Order

SELECT #Client_Date_product.Client_ID
, #Client_Date_product.Product_ID
, #Client_Date_product.First_Transaction_Date
, #Transaction_Order.Transaction_Order
INTO #Root_Table
FROM #Client_Date_product
INNER JOIN #Transaction_Order ON #Client_Date_product.Client_ID
= #Transaction_Order.Client_ID
AND #Client_Date_product.First_Transaction_Date
= #Transaction_Order.First_Transaction_Date
ORDER BY #Client_Date_product.Client_ID
, #Client_Date_product.First_Transaction_Date
, #Client_Date_product.Product_ID

SELECT START.Client_ID AS START_Client_ID
, START.Product_ID AS START_Product_ID
, END.Client_ID AS END_Client_ID
, END.Product_ID AS END_Product_ID
, END.First_Transaction_Date
, START.Transaction_Order
INTO #Flows
FROM #Root_Table START
LEFT OUTER JOIN #Root_Table END ON START.Client_ID
= END.Client_ID
AND START.Transaction_Order + 1 = END.Transaction_Order
WHERE END.Product_ID IS NOT NULL
ORDER BY START.Client_ID
, START.First_Transaction_Date

SELECT #Flows.START_Client_ID AS Client_ID
, #Flows.START_Product_ID
, #Flows.END_Product_ID
, #Flows.First_Transaction_Date
, Coordinate_Table.Product_ID
, Coordinate_Table.X AS Line_X
, Coordinate_Table.Y AS Line_Y
, Coordinate_Table.Y AS Circle_Y
, #Flows.Transaction_Order
INTO #FLOW_coords
FROM #Flows
INNER JOIN Coordinate_Table ON #Flows.START_Product_ID
= Coordinate_Table.Product_ID
UNION ALL
SELECT #Flows.START_Client_ID
, #Flows.START_Product_ID
, #Flows.END_Product_ID
, #Flows.First_Transaction_Date
, Coordinate_Table.Product_ID
, Coordinate_Table.X AS Line_X
, Coordinate_Table.Y AS Line_Y
, Coordinate_Table.Y AS Circle_Y
, #Flows.Transaction_Order
FROM #Flows
INNER JOIN Coordinate_Table ON #Flows.END_Product_ID
= Coordinate_Table.Product_ID

Technically, you could already draw your network in Tableau, but I think the data would be a lot more meaningful if it were enriched and stored with the known dimensions.

DROP TABLE Product_Network
SELECT #FLOW_coords.START_Product_ID
, #FLOW_coords.END_Product_ID
, #FLOW_coords.Product_ID
, NODE.Product AS Node_Product
, NODE.Type_Product AS Node_Type_Product
, NODE.Costing_Price AS Node_Costing_Price
, #FLOW_coords.Line_X
, #FLOW_coords.Line_Y
, #FLOW_coords.Circle_Y
, #FLOW_coords.First_Transaction_Date
, #FLOW_coords.Transaction_Order
, START.Product AS START_Product
, START.Type_Product AS START_Type_Product
, START.Costing_Price AS START_Costing_Price
, END.Product AS END_Product
, END.Type_Product AS END_Type_Product
, END.Costing_Price AS END_Costing_Price
, Dim_Clients.Client_ID
, Dim_Clients.Name
, Dim_Clients.Gender
, DATEDIFF(DAY, Dim_Clients.First_Contact
#FLOW_coords.First_Transaction_Date) AS Days_Since_First_Contact
, DATEDIFF(YEAR, Dim_Clients.BirthDate,
#FLOW_coords.First_Transaction_Date) AS Age
, Dim_Clients.City
, Dim_Clients.Nationality
, Dim_Clients.Insurer
INTO Product_Network
FROM #FLOW_coords
INNER JOIN Dim_Products START ON #FLOW_coords.START_Product_ID
= START.Product_ID
INNER JOIN Dim_Products END ON #FLOW_coords.END_Product_ID
= END.Product_ID
INNER JOIN Dim_Products NODE ON #FLOW_coords.Product_ID
= NODE.Product_ID
INNER JOIN Dim_Clients ON #FLOW_coords.Client_ID
= Dim_Clients.Client_ID
ORDER BY 2
, 3
, 1

Tableau

The data has now been transformed into something readable for Tableau. To actually make it into a network I would recommend you follow one of these tutorials:

Clearly and Simply

Tableau Public (Florian Ramseger)

They are great and both have different results!

Future cool stuff Tableau should support

  • Just network graphs in general, without all the hassle in other languages
  • Directed network graphs, to show paths in data
  • Native non-straight lines, to be able to cluster edges and make things prettier in various ways, like this for example:
http://people.ischool.berkeley.edu/~sonali.sharma/projects.html
http://people.ischool.berkeley.edu/~sonali.sharma/projects.html

Thank you for reading! Leave a comment if you have any questions.

Download the project here.

Special thanks to:

Konstantin Greger , Diederick Koetsier and Florian Ramseger


Originally published at Daan Tor.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade