Solving Network Puzzles: How Graph Theory Simplifies Analysis

Emine Yavuz
KoçDigital
Published in
7 min readOct 25, 2023

Introduction

In an era where extracting insights from data is crucial but understanding relationships within datasets is challenging, the use of different tools become essential. While we can always try merging datasets, this method may fall short when we deal with vast volumes of data requiring quick analysis. In this post, we will illustrate the use of graph theory which proves to be quite convenient for similar cases.

Take, for instance, a bank’s need to detect fraud activities. It is a waste of time and effort for the bank to merge data from all its customers and monitor transactions in real-time. This is where graph theory steps in, transforming complex tabular data into a clear, understandable graph and thus allowing us to track suspicious activities in banking operations instantaneously.

Similarly, for online shopping platforms, the ability to deliver real-time advertisements is so important but it is not easy to analyze a user’s previous shopping history and their current preferences immediately. In such scenarios, graph theory becomes an invaluable tool.

Today, we will dive into the foundational aspects of graph theory and construct a graph using a sample tabular dataset.

1. Graph Structure

We will use the Covid-19 Vaccine Procurement Dataset which is collected by the Duke Global Health Innovation Center.

There are two main components to identify an abstract graph: nodes and edges. Nodes are the vertices on the graph which represent the entities forming the graph and edges represent the relationship between nodes. For example, we can see that there is a relation of Russia with Germany, but not with Belgium.

So, we need to identify the nodes to create a graph from a tabular dataset and specify the relationship between them. Nodes which represent the entities, can be identified in 3-step:

1. Element: Elements are the smallest part of our graph, basically each column of the tables.

Example from this dataset: Company and Vaccine Name, License Holder, Company’s Country, Purchaser Entity / Country, Number of Doses Procured, COVID Burden (cases/million), Doses intended to be purchased, Limited Regulatory Approval.

2. Compounds: Compounds are the group of elements.

Example from this dataset: Compound can be created by merging Company Name and Company’s Country.

3. Entity: Entities are the group of multiple or single compounds that can be identifiable. They differ by their attributes. Entities signify one of the key parts of the graph, nodes.

Example from this dataset: Vaccine company’s country and the purchaser country are the entities of this graph where they are related from procurement activity.

Next, we define the edges of our graph, in other words we establish the relationship between entities.

Example from this dataset: Each procurement process shows that there is a relation between nodes. Oxford-AstraZeneca _AZD1222 vaccine company is in United Kingdom but also United Kingdom is purchaser country for vaccines.

2. Exploratory Data Analysis

The second step of creating a graph structure from tabular dataset is data preprocessing. We address some issues that we found in the data.

def data_preprocess_1(data):

data.columns = [x.replace(" ", "_") for x in data.columns.values]
data_short = data[["Company's_Country",'Purchaser_Entity_/_Country']].drop_duplicates()
data_short = data_short.apply(lambda x: x.str.rstrip())
data_short.dropna(inplace=True)

return(data_short)

df_entity = data_preprocess_1(data)

We notice that there are excessive spaces in column names, so we replace these spaces with ‘_’ in the first line of the function.

In the second line of the function, we select our nodes. Since the relationship we want to capture is the relationship between countries, we filter two columns: the vaccine-producing country and the purchasing country.

Due to the empty spaces in the data, we notice that some values are duplicated (e.g., “UK” vs. “UK “). In the third line, we use the rstrip method to remove trailing spaces from all rows. In the last line before returning the data frame, we drop the null values.

Another issue is that some vaccines are produced by companies from two different countries, so to find the relationship between them, we will need to separate the country names and duplicate the data for each sale.

def vaccine_country_division(dataframe,first_country,second_country):

data_main = dataframe[dataframe["Company's_Country"] != first_country + '/' + second_country]

df_first = dataframe[dataframe["Company's_Country"] == first_country + '/' + second_country]
df_second = dataframe[dataframe["Company's_Country"] == first_country + '/' + second_country]

df_first["Company's_Country"] = first_country
df_second["Company's_Country"] = second_country

df_final_entity = pd.concat([data_main,df_first, df_second], ignore_index=True)

return(df_final_entity)

For example, Pfizer-BioNTech_BNT162 vaccine is produced by American company Pfizer and German biotechnology company BioNTech. So, we need to record the sale of a Pfizer-BioNTech_BNT162 vaccine to both the USA and Germany. We split the data labeled as USA/Germany into USA and Germany with this function and we multiply the sales for both countries. In addition to this example, we also do the same process for Sanofi-GSK_SARS-CoV-2 Vaccine of France and UK.

def data_preprocess_2(data):

df_divided_1 = vaccine_country_division(data,'USA','Germany')
df_divided_2 = vaccine_country_division(df_divided_1,'France','UK')

df_entity = df_divided_2.copy()

for col in df_entity.select_dtypes(['object']).columns:
df_entity[col] = df_entity[col].astype(str)

return(df_entity)

After executing the vaccine_country_division function, we convert the columns with the data type object into strings and this data_preprocess_2 function returns the new dataset.

After whole preprocessing, the data transformed from the format in the first image to the format in the second image.

3. Adjacency Matrix

Adjacency matrix helps us to understand the relation between nodes. Basically, data points which have relation/have an edge are represented with 1, and others with 0. If we have multiple datasets, first, we gather all entities (all data points in dataset) together. Even if there is not a relation between entities, we show ‘non-relation’ with ‘0’. So, we need all pairs of nodes to show all relationships.

The NetworkX, a Python package, can be useful to create networks from dataset and then produce the adjacency matrix. But what if the dataset is too big to create a graph immediately?

One way to solve this problem is to create an adjacency matrix in the first place. This method can decrease the time needed to build graphs. Second, it enables to use memory more effectively since out-of-memory error is a common problem of network study. Finally, this method can facilitate the analysis of the relations at first glance. The relationship between the nodes can be analyzed without creating the graph.

The first step of creating the adjacency matrix is gathering all nodes together. As it is mentioned above, one country can be both buyer and seller of a vaccine. Let’s dive into the Python code:

def create_adjacency_matrix(df, column_set):

# First step
nodes = pd.concat([df[col] for col in column_set]).unique()

# Second step
adj_matrix = pd.DataFrame(0, index=nodes, columns=nodes)
print(adj_matrix.head())

# Third step
for _, row in df.iterrows():
for i in range(len(column_set)):
for j in range(len(column_set)):
if i != j:
adj_matrix.at[row[column_set[i]], row[column_set[j]]] = 1

# Fourth step
#for node in nodes:
#adj_matrix.at[node, node] = 1


return adj_matrix

First step: We gather and singularize the values in all columns of the data into a single data frame: nodes.

Second step: We create an empty adjacency matrix with 0 values. This matrix will be filled in the third step.

Third step: In this step, we fill the adjacency matrix with value ‘1’ if there is a procurement process between the entity in the row and entity in the column. At the beginning of the function, we provide a data frame and the columns that will create the nodes of our network, as inputs. Here, we pass the final version of our data frame and the columns “Company’s_Country” and “Purchaser_Entity_/_Country” as input.

columns_to_connect = [ "Company's_Country","Purchaser_Entity_/_Country"]
adj_matrix = create_adjacency_matrix(data_preprocess_2(df_entity), columns_to_connect)

We replace the value 0 to value 1 in the adjacency matrix where the row-column pair is equal to row-column pair in the data frame. The row-column pair in the actual data indicates a purchase transaction between them. Therefore, we place a value of 1 in their respective positions in the adjacency matrix.

Fourth step: Since every node is related to itself, rows and columns with same specific node name are filled with ‘1’. However, if you only want to see its relationships with others, you can comment out this part.

4. Result

Graph theory helps us see the trade patterns of vaccines more clearly. The United Kingdom and the United States are the top two countries that engage in the most vaccine transactions among all countries.

Conclusion

Throughout this post, we review the basics of graph theory and explore how to create a graph from tabular data. Eventually, we construct a structure that allows us to extract real-time insights from what may initially seem like complex data by transforming this complex tabular data into a clear, understandable graph.

Emine Yavuz, Data Scientist

Şeymanur Ergezgin, Data Scientist

References

Ahmed, I. (2021). Kaggle. Retrieved from Kaggle Web site: https://www.kaggle.com/datasets/ibtesama/covid19-vaccine-procurement-dataset

West, D. B. (2001). Introduction to Graph Theory. Prentice Hall.

--

--