An Intro to Graph Theory and Analysis using Tidygraph

Learn the basics of structuring and analyzing graph data in R from an experienced data scientist.

Joshua Limaye
Slalom Technology
9 min readMay 18, 2022

--

Photo by Conny Schneider on Unsplash

Data is synonymous with everything we do, and how that data is represented for analysis often takes on many forms. Most are familiar with the conventional representations of data. Tabular data includes a row for each observation and a column for each feature. Pixel values in a photo can help deduce its content. Vibrations in our voice help convert speech into text.

Another structured — and powerful — way of representing data is through a graph, opening the door to a library of highly-efficient algorithms we can use for analysis. This article covers the basics of graph theory, showing you several examples of how we can leverage it to perform graph analysis in R using the tidygraph package.

Graph theory

Formally, graph theory is the study of mathematical structures used to model pairwise relationships between objects. These pairwise relationships constitute a graph. While technical and concise, this definition may leave you wanting, so let me offer my own explanation.

Graphs are composed of two elements: nodes and edges. You can think of nodes as data nouns. If a node shares a relationship with another node, they’re connected by an edge. Think of edges as data adjectives that describe the relationship between nodes.

For example, a social network can be represented as a graph. Nodes represent individuals, and edges represent friendships between individuals. We can see below that Zoe and Tom are friends. In other words, we’ve represented the pairwise friendship between Tom and Zoe as a graph.

Social Network Graph

Tidygraph

Tidygraph is an R package developed by Thomas Lin Pedersen that allows us to represent data as a graph under the tidy framework. This gives us the freedom to interact with our graph using the popular dplyr verbs that are typically used to manipulate data frames. The mechanics of tidygraph leverage igraph — a popular network analysis library we can use to analyze our graph data.

To illustrate the fundamentals, let’s use an example wherein we want to model the relationships in a supply chain as a graph. First, let’s load the necessary libraries then create three data frames mapping the relationships between suppliers, producers, distributors, and customers.

#install packages if needed!!!
#install.packages('tidyverse')
#install.packages('tidygraph')
library(tidyverse)
library(tidygraph)
supplier_producer <- data.frame(
supplier = c('Supp. A', 'Supp. A', 'Supp. B',
'Supp. B', 'Supp. C', 'Supp. D'),
producer = c('Manuf. A', 'Manuf. B', 'Manuf. A',
'Manuf. C', 'Manuf. B', 'Manuf. D')
)
producer_distributor <- data.frame(
producer = c('Manuf. A', 'Manuf. A', 'Manuf. B',
'Manuf. B', 'Manuf. C', 'Manuf. D'),
distributor = c('Dist. A', 'Dist. B', 'Dist. B',
'Dist. C', 'Dist. B', 'Dist. C')
)
distributor_customer <- data.frame(
distributor = c('Dist. A', 'Dist. A', 'Dist. B',
'Dist. C', 'Dist. C', 'Dist. C'),
customer = c('Cust. A', 'Cust. B', 'Cust. C',
'Cust. C', 'Cust. D', 'Cust. E')
)

Next, let’s consolidate our three data frames into a single data frame. Notice that we rename our columns to use a “from-to” naming convention. This isn't required, but it’s a good idea to be explicit about the direction of the relationships between nodes (if there is one). Then let’s cast our data frame to a tidygraph graph object using the as_tbl_graph() function.

network_df <- rbind(
supplier_producer %>%
rename(from = supplier, to = producer),
producer_distributor %>%
rename(from = producer, to = distributor),
distributor_customer %>%
rename(from = distributor, to = customer)
)
network_grph <- as_tbl_graph(network_df)network_grph

The result is a single object composed of two data frames: a node data frame and an edge data frame. The node data frame contains the unique set of node names while the edge data frame contains the relationships between nodes. Notice that relationships are encoded with numeric IDs instead of node names. The numeric IDs correspond to the row numbers in the node data frame.

Graph Object Output

Because tidygraph graph objects support a tidy framework, we can easily apply dplyr operations to either of the data frames. Let’s add the numeric IDs to our node data frame.

network_grph <- network_grph %>%
mutate(id = row_number())
network_grph
Node Data Frame Mutated Output

Because we’re able to interact with each data frame independently, tidygraph includes an activate() function we can use to select the data frame we want to operate on. By default, the node data frame is active. Let’s activate the edge data frame and add a column that we’ll pretend is the geographical distance between suppliers, producers, distributors, and customers.

network_grph <- network_grph %>%
activate(edges) %>%
mutate(distance = from + to)
network_grph
Edge Data Frame Mutated Output

Finally, let’s visualize our graph as a Sankey network using the networkD3 package. To use the sankeyNetwork() function, we need to prepare our data in the following ways:

  • Make this a function that accepts a graph and color argument so we can create similar visuals in the future.
  • Activate and save the node and edge data frames as their own objects because they will be passed as independent arguments.
  • Rename the from and to columns in the edge data frame to “source” and “target” respectively.
  • Subtract one from the “from” and “to” columns in the edge data frame because the numeric ID’s must be indexed at zero, but the as_tbl_graph() function indexes numeric IDs starting at one by default.
  • Select the “source”, “target”, and “distance” columns out from the edge data frame. The “distance” column will be used to define the edge width of our Sankey.
#install packages if needed!!!
#install.packages('networkD3')
library(networkD3)create_sankey <- function(grph_object, colors){nodes <- grph_object %>%
activate(nodes) %>%
data.frame()

edges <- grph_object %>%
activate(edges) %>%
mutate(source = from - 1,
target = to - 1) %>%
data.frame() %>%
select(source, target, distance)

network_p <- sankeyNetwork(Links = edges, Nodes = nodes,
Source = 'source', Target = 'target',
Value = 'distance', NodeID = 'name',
fontSize = 14, nodeWidth = 20,
sinksRight = FALSE,
colourScale = colors
)
print(network_p)

}
#please put the colourScale declaration on a single line!!!
colourScale ='d3.scaleOrdinal() .range(["#FDE725FF","#B4DE2CFF","#6DCD59FF",
"#35B779FF","#1F9E89FF","#26828EFF",
"#31688EFF","#3E4A89FF","#482878FF","#440154FF"])'
p <- create_sankey(grph_object = network_grph, colors = colourScale)
p

While it requires a lot of additional manipulation to create, calling our function returns an attractive visual of our supply chain.

Supply Chain Sankey

Graph analysis

Armed with the basics of tidygraph, let’s act out two scenarios a supply chain manager might encounter in the real world.

Scenario #1

In the first scenario, imagine that we’ve discovered that Supplier D recently sent out a defective batch of materials. To avoid polluting our supply chain, we need to identify every manufacturer, distributor, and customer that’s downstream of Supplier D.

To do this, we’ll use the bfs_dist() algorithm. It traces the edge(s) from a starting node up, down, or in both directions of the graph to identify any other nodes in it’s path and how many edges away those nodes are from the starting node.

Breadth First Search Visual

To use the bfs_dist() algorithm, we’ll run the code below to accomplish the following:

  • Retrieve the starting node’s ID because bfs_dist() takes it as an argument.
  • Activate the node data frame because bfs_dist() requires it be activate.
  • Tidygraph algorithms are called in one of two ways: using the morph() function, or by returning the results using a dplyr verb. Create a new variable set equal to the results of bfs_dist() because it uses a dplyr verb.
  • Pass an argument for our starting node and specify the direction to trace the graph using the “mode” argument to call bfs_dist(). Passing “out” traces down the supply chain, passing “in” traces up the supply chain.
bad_supplier = 'Supp. D'bad_supplier_id <- network_grph %>%
activate(nodes) %>%
filter(name == bad_supplier) %>%
pull(id)
network_grph <- network_grph %>%
activate(nodes) %>%
mutate(bfs_results = bfs_dist(root = bad_supplier_id,
mode = 'out')
)
network_grph

When we run the code above and view our graph object, we can see that the variable we created has successfully been populated with the results of bfs_dist(). If a node was in Supplier D’s path, the number of edges that node is away from Supplier D’s node has been returned. If an “NA” was returned, it indicates that node isn’t in Supplier D’s path.

bfs_dist() Results

Using our results, we can quickly identify each manufacturer, distributor, and customer that may be affected. To communicate these findings, let’s use the function we created earlier to visualize Supplier D’s path in our supply chain.

bfs_p <- create_sankey(
grph_object = network_grph %>%
filter(!is.na(bfs_results)),
colors = colourScale
)
bfs_p
bfs_dist() Results Sankey

Scenario #2

In the second scenario, imagine that leadership is in the process of renewing contracts with distributors and is trying to quantify the importance of each one to the supply chain. We need to identify the relative importance of each distributor to help leadership competitively renegotiate.

To quantify the importance of each distributor, let’s determine the number of manufactures and customers each distributor receives goods from and sends goods to. To do this, we’ll use the to_local_neighborhood() algorithm. It traces the path from a starting node up, down, or in both directions of the graph a number of edges away we predetermine.

To Local Neighborhood Visual

To use the to_local_neighborhood() algorithm, we’ll run the code below:

  • Retrieve the starting node ID for each distributor because to_local_neighborhood() takes it as an argument.
  • Write a simple for loop to iterate through the vector of IDs so we can analyze each distributor.
  • Activate the node data frame because to_local_neighborhood() requires it be active.
  • To use to_local_neighborhood(), we must use the morph() function. It returns the results of our algorithm as a quasi-graph object that we can interact with and then merge back into our original graph object with our results using the unmorph() function.
  • Pass the algorithm name, the starting node’s ID, the number of edges out to trace, and the direction to trace to the morph() function to call to_local_neighborhood(). Passing “all” to “modewill trace the graph up and down the supply chain.
  • Create a variable called “neighbor” set equal to “TRUE” for the nodes that are returned from calling to_local_neighborhood()
  • Merge our results back into our original data frame using the unmorph() function, cast our nodes data frame to a stand alone data frame, filter out nodes that are neighbors to our starting node, and count the number of neighbors. The starting node is returned as it’s own neighbor, so filter it out too.
  • Filter out the distributor name that corresponds to its numeric ID, and print it out along with it’s number of neighbors
distributor_id <- network_grph %>%
activate(nodes) %>%
filter(str_detect(name, 'Dist.')) %>%
data.frame() %>%
pull(id)
for(i in distributor_id){

n_neighbors <- network_grph %>%
activate(nodes) %>%
morph(to_local_neighborhood,
node = i,
order = 1,
mode = 'all') %>%
mutate(neighbor = TRUE) %>%
unmorph() %>%
data.frame() %>%
filter(neighbor == TRUE,
id != i) %>%
nrow()

distributor_name <- network_grph %>%
activate(nodes) %>%
filter(id == i) %>%
pull(name)

print(paste0('Distributor: ', distributor_name,
' Neighbors: ', n_neighbors))

}

When we run the code above, we can see that we’ve successfully identified each distributor’s number of neighbors. This information is valuable for leadership, helping them understand the relative importance of each distributor as they begin renegotiations.

to_local_neighborhood() Results

In conclusion

In this article, we’ve reviewed what the fundamentals of graph theory and analysis is, as well as how you can use the tidygraph package in R to apply what you’ve learned. The next time you encounter a problem involving constructing and analyzing complex relationships efficiently, remember that you may be able to represent your data as a graph and make use of many powerful and efficient algorithms.

Slalom is a global consulting firm that helps people and organizations dream bigger, move faster, and build better tomorrows for all. Learn more and reach out today.

--

--

Joshua Limaye
Slalom Technology

I am a professional Data Scientist with a passion for all things analytics! I love to learn about this ever advancing field, and share my knowledge with others!