Converting Tabular Dataset(CSV file ) to Graph Dataset with Pytorch Geometric

Tejpal Kumawat
12 min readMar 26, 2023

--

Graph datasets are emerging at breakneck speed these days, all chemical molecules, social networks, and recommendation system mainly store their data in the form of Graph Data structure.

How to Convert CSV file Graph Data structure

Identify the basic information needed for the graph data

  • Nodes (Items, People, Locations, Cars, …)
  • Edges (Connections, Interactions, Similarity, …)
  • Node Features (Attributes)
  • Labels (Node-level, edge-level, graph-level)

and optionally:

  • Edge weights (Strength of the connection, number of interactions, …)
  • Edge features (Additional (multi-dim) properties describing the edge)

Check whether we have homogenous(same type) (nodes, edges) or heterogeneous9different type) (nodes, edges)

Homogenous (nodes, edges )

For our use case, we will use FIFA 21 Rating dataset, a dataset with soccer players.

Load CSV files

import pandas as pd

# Download data (quietly)
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_player.csv
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_player_skill.csv
!wget -q https://raw.githubusercontent.com/batuhan-demirci/fifa21_dataset/master/data/tbl_team.csv

# Load data
player_df = pd.read_csv("tbl_player.csv")
skill_df = pd.read_csv("tbl_player_skill.csv")
team_df = pd.read_csv("tbl_team.csv")

# Extract subsets
player_df = player_df[["int_player_id", "str_player_name", "str_positions", "int_overall_rating", "int_team_id"]]
skill_df = skill_df[["int_player_id", "int_long_passing", "int_ball_control", "int_dribbling"]]
team_df = team_df[["int_team_id", "str_team_name", "int_overall"]]

# Merge data
player_df = player_df.merge(skill_df, on='int_player_id')
fifa_df = player_df.merge(team_df, on='int_team_id')

# Sort dataframe
fifa_df = fifa_df.sort_values(by="int_overall_rating", ascending=False)
print("Players: ", fifa_df.shape[0])
fifa_df.head()

Output looks like

Let’s first identify the graph-specific things we need:

  • Nodes - Football players (by ID)
  • Edges - If they play for the same team or for a different team
  • Node Features - The football player's position, specialties, ball control, ...
  • Labels - The football player's overall rating (node-level regression task)

Nodes are usually very straightforward to identify — here we even have IDs. If you don’t have a unique identifier, you need one, because you need to know between which nodes a connection exists!

The most challenging task is typically to link these nodes somehow through edges. Here we define the edges based on the team assignment. With this dataset, we could predict the expected rating when a player switches to a new team or a new player is observed. Therefore we expect relational effects through the team assignment. Of course, there are many other possibilities to define the edges such as:

  • How many times two players played together (edge weight) → Synergies
  • How many times a player has won/lost 1:1 duels (edge weight)
  • Started their career in the same football club
  • Temporal edges: “Played together in the last 2 weeks”

As you can see, there are many choices on how to combine instances in the data frame. We will continue with the easiest approach, which is connecting them according to their team assignments.

# Make sure that we have no duplicate nodes
max(fifa_df["int_player_id"].value_counts())

Each football player ID occurs only once in our dataset.

Note that we plan to build one single graph here! If individual node-id’s occur more than once in your dataset, there are different options:

  • You have multiple graphs that can contain the same node. In this case, you need to iterate over each subset of your data frame, that belongs to one individual graph and do the calculations on this subset
  • You have to aggregate multiple rows into one. For example, if you have transactional data (like a payment history), you would need to summarize this somehow into one feature vector, such as #payments, payment amount, …

Extract the node features

The node features are typically represented in a matrix of the shape (num_nodes, node_feature_dim).

For each of the football players, we simply extract their attributes. Because each player id is unique, we can easily do this based on the original data frame. Have a look at the other examples in this notebook to see what an aggregation can look like if you have multiple rows for individual nodes.

# Sort to define the order of nodes
sorted_df = fifa_df.sort_values(by="int_player_id")
# Select node features
node_features = sorted_df[["str_positions", "int_long_passing", "int_ball_control", "int_dribbling"]]
# Convert non-numeric columns
pd.set_option('mode.chained_assignment', None)
positions = node_features["str_positions"].str.split(",", expand=True)
node_features["first_position"] = positions[0]
# One-hot encoding
node_features = pd.concat([node_features, pd.get_dummies(node_features["first_position"])], axis=1, join='inner')
node_features.drop(["str_positions", "first_position"], axis=1, inplace=True)
node_features.head()

Output looks like :

That’s already our node feature matrix. The number of nodes and the ordering is implicitly defined by their shape. Each row corresponds to one node in our final graph.

# Convert to numpy
x = node_features.to_numpy()
x.shape # [num_nodes x num_features]

Extract Labels for each node

Those are simply the ratings of each of the players. This corresponds to a node-level prediction problem. Therefore we have as many labels as we have nodes. Of course, it can happen that we don’t have labels for all nodes and in this case, it makes sense to define masks using Pytorch Geometric’s helper functions: here

# Sort to define the order of nodes
sorted_df = fifa_df.sort_values(by="int_player_id")
# Select node features
labels = sorted_df[["int_overall"]]
labels.head()

output looks like:

Extract Edges

That’s probably the trickiest part with a tabular dataset. You need to think of a reasonable way to connect your nodes. As mentioned previously, we will use the team assignment here.

AGAIN: There are many ways to connect the entities in a dataset and this approach is very trivial (as it will lead to disconnected subgraphs). If I wanted to build a real model from this dataset, I would probably look for a more sophisticated way to connect the players. Using a GNN is a bit overkill for the way I model the edges.

We now need to find the pairs of players that are assigned to the same team. Let’s first check how many players per team we have.

# Remap player IDs
fifa_df["int_player_id"] = fifa_df.reset_index().index
# This tells us how many players per team we have to connect
fifa_df["str_team_name"].value_counts()

We now need to build all permutations of these players within one team, which corresponds to a fully-connected graph within each team subgroup. We use the column int_player_id as indices for the edges. If there is for example a [0, 1] in the edge index, it means that the first and second nodes (regarding the previously defined node feature matrix) are connected.

import itertools
import numpy as np

teams = fifa_df["str_team_name"].unique()
all_edges = np.array([], dtype=np.int32).reshape((0, 2))
for team in teams:
team_df = fifa_df[fifa_df["str_team_name"] == team]
players = team_df["int_player_id"].values
# Build all combinations, as all players are connected
permutations = list(itertools.combinations(players, 2))
edges_source = [e[0] for e in permutations]
edges_target = [e[1] for e in permutations]
team_edges = np.column_stack([edges_source, edges_target])
all_edges = np.vstack([all_edges, team_edges])
# Convert to Pytorch Geometric format
edge_index = all_edges.transpose()
edge_index # [2, num_edges]
array([[    0,     0,     0, ..., 18704, 18704, 18719],
[ 7, 32, 45, ..., 18719, 18751, 18751]])


e = torch.tensor(edge_index, dtype=torch.long)
print(e)

edge_index1 = e.t().clone().detach()
edge_index1
# output

tensor([[ 0, 7],
[ 0, 32],
[ 0, 45],
...,
[18704, 18719],
[18704, 18751],
[18719, 18751]])

Build the dataset

from torch_geometric.data import Data
data = Data(x=x, edge_index=edge_index1.to().contiguous(), y=y)

This data object represents one single graph.

Typically several graphs are combined in a dataset object. For this please refer to the documentation or this video. Other than that, you can also quickly build a dataloader as follows. Just create a list of all your graphs and pass them to the Pytorch Geometric dataloader.

from torch_geometric.loader import DataLoader
data_list = [Data(...), ..., Data(...)]
loader = DataLoader(data_list, batch_size=32)

Let’s visualize our developed graph :

We will use networkx for this task:

Let’s convert our Pytorch Geometric Graph into NetworkX graph

from torch_geometric.data import Data
data = Data(x=x, edge_index=edge_index1.t().contiguous(), y=y)

from torch_geometric.data import Data
from torch_geometric.utils import to_networkx
networkX_graph = to_networkx(data)
import networkx as nx
nx.draw(networkX_graph)

It looks like this

It looks a little confusing because we have a lot of nodes and so many edges in it. But when we try to visualize the subgraph of this graph, that could make sense to us.

Well, our Graph is ready for Homogenous nodes

2. Heterogenous (nodes or edges )

Recommender systems are a classical example of this and therefore I chose the Anime Recommender Database (a movie recommendation dataset).

Loading dataset

import pandas as pd

# Download data (quietly)
!wget -q https://raw.githubusercontent.com/Mayank-Bhatia/Anime-Recommender/master/data/anime.csv
!wget -q https://raw.githubusercontent.com/Mayank-Bhatia/Anime-Recommender/master/data/rating.csv

anime = pd.read_csv("anime.csv")
rating = pd.read_csv("rating.csv")

Let’s just like before first identify the graph entities we need.

  • Nodes - Users and Animes (two node types with different features = heterogeneous)
  • Edges - If a user has rated a movie / the rating (edge weight)
  • Node Features - The movie attributes and for the users, we have no explicit features so we have to figure something out later
  • Labels - The rating for a movie (link prediction regression task)

This dataset will, just like Example 1 lead to one single graph that contains all nodes and edges. Given a pair of nodes and anime movies, we will be able to predict if / how the user likes this movie. To model this as a graph, we will have to support two node types: movie and user. That's because they have different node features (and shapes) that would not fit into one joint node feature matrix.

Extract the node features

Each of the movies occurs only once in the anime data frame and hence we can directly extract the features from there. If you have multiple entries for each node (movie ID) in your data frame.

We will just extract the columns with specific attributes and convert them to numeric features…

For the anime movies …

  • First, we need to do a re-mapping of the IDs. That’s because they don’t start with 0 and also not all IDs are present. That’s however important because the edge_index is always referring to the index in the node feature matrix
  • We will store this mapping because we will need it later
# Sort to define the order of nodes
sorted_df = anime.sort_values(by="anime_id").set_index("anime_id")

# Map IDs to start from 0
sorted_df = sorted_df.reset_index(drop=False)
movie_id_mapping = sorted_df["anime_id"]

# Select node features
node_features = sorted_df[["type", "genre", "episodes"]]
# Convert non-numeric columns
pd.set_option('mode.chained_assignment', None)

# For simplicity I'll just select the first genre here and ignore the others
genres = node_features["genre"].str.split(",", expand=True)
node_features["main_genre"] = genres[0]

# One-hot encoding
anime_node_features = pd.concat([node_features, pd.get_dummies(node_features["main_genre"])], axis=1, join='inner')
anime_node_features = pd.concat([anime_node_features, pd.get_dummies(anime_node_features["type"])], axis=1, join='inner')
anime_node_features.drop(["genre", "main_genre"], axis=1, inplace=True)
anime_node_features.head(10)

Output Looks like

# Convert to numpy
x = anime_node_features.to_numpy()
x.shape # [num_movie_nodes x movie_node_feature_dim]

For the users …

Here we are missing a data frame that describes the attributes of each user. As a workaround we have different options:

  • Either we insert dummies (for example random values between 0 and 1 like [0, 0.5, 0.1, 1]), which will then be updated through message passing
  • Or we calculate some stats about the users, such as average rating, number of ratings, … (based on the rating data frame)
  • Or we use typical characteristics of each node as features (degree, neighborhood, or even Node2Vec embedding)

We will go with the second option here.

# Find out mean rating and number of ratings per user
mean_rating = rating.groupby("user_id")["rating"].mean().rename("mean")
num_rating = rating.groupby("user_id")["rating"].count().rename("count")
user_node_features = pd.concat([mean_rating, num_rating], axis=1)

# Remap user ID (to start at 0)
user_node_features = user_node_features.reset_index(drop=False)
user_id_mapping = user_node_features["user_id"]

# Only keep features
user_node_features = user_node_features[["mean", "count"]]
user_node_features.head()
# Convert to numpy
x = user_node_features.to_numpy()
x.shape # [num_user_nodes x user_node_feature_dim]

Those are already our node feature matrices. We could of course also normalize the values to be in the range of (0,1).

For the movies, we have clear attributes that describe each node. For the users, we have calculated some basic properties that provide information about the rating behavior.

The number of nodes and the ordering is implicitly defined by their shape. Each row corresponds to one node in our final graph.

Extract the labels

In this example, we have a link prediction/regression problem and thus the labels are the edges. The plot below shows the distribution of the ratings. Later the task will be to predict the ratings between a user and a movie.

Unlike in Example 1 the labels are now equal to the number of edges.

rating.head()

# Output

user_id anime_id rating
0 1 20 -1
1 1 24 -1
2 1 79 -1
3 1 226 -1
4 1 241 -1
#-1 means the user watched but didn't assign a weight
rating["rating"].hist()

As you can see (below), we don’t have all of the movies in the rating table (which is natural, because we usually don’t have ratings for all items). This means we don’t have labels for all user-item pairs, but only a subset.

To consider this in the loss calculation, we can simply store a mask of the indices that are available. Previously I also quickly talked about masks, in the node-level prediction case. It is exactly the same here — we just want to perform predictions for the entities for which we have a label.

As we have an edge-prediction problem here, we implicitly stored this mask already as edge_index. For each edge, we know the label and therefore we only have to calculate the loss based on the edges we know. Later at inference time, we can also predict the edge attributes (labels) for other node pairs.

# Movies that are part of our rating matrix
rating["anime_id"].unique()

# Output
array([ 20, 24, 79, ..., 29481, 34412, 30738])

# All movie IDs (e.g. no rating above for 1, 5, 6...)
anime["anime_id"].sort_values().unique()

# Output
array([ 1, 5, 6, ..., 34522, 34525, 34527])

# We can also see that there are some movies in the rating matrix, for which we have no features (we will drop them here)
print(set(rating["anime_id"].unique()) - set(anime["anime_id"].unique()))
rating = rating[~rating["anime_id"].isin([30913, 30924, 20261])]

# Output
{30913, 30924, 20261}

Extract labels

# Extract labels
labels = rating["rating"]
labels.tail()
# Convert to numpy
y = labels.to_numpy()
y.shape

Extract the edges

In this example, the edges are already implicitly provided by the rating matrix. The important part however is that we need to use the remappings from before to align the IDs of the data frames.

For each entry in the matrix, we have exactly one edge, between the user_id and the anime_id. Therefore, the edge index is exactly the same as what we have calculated in the cell above.

The edge index can later be used in the model to mask out all edges for which we have no information.

# Map anime IDs 
movie_map = movie_id_mapping.reset_index().set_index("anime_id").to_dict()
rating["anime_id"] = rating["anime_id"].map(movie_map["index"]).astype(int)
# Map user IDs
user_map = user_id_mapping.reset_index().set_index("user_id").to_dict()
rating["user_id"] = rating["user_id"].map(user_map["index"]).astype(int)

print("After remapping...")
rating.head()
edge_index = rating[["user_id", "anime_id"]].values.transpose()
edge_index # [2 x num_edges]

Build the dataset

Now we have all the components we need to build a graph for libraries like Pytorch Geometric or DGL. I won’t install these libraries here, as this will make the notebook too bulky, but here are some code snippets for the final steps.

For Heterogeneous Graphs, we need to store the individual matrices in HeteroData objects, which can hold multiple node/edge matrices. There is also a great tutorial for heterogenous graphs in PyG

from torch_geometric.data import HeteroData
data = HeteroData()
data['user'].x = user_node_features
data['movie'].x = anime_node_features

If you have different edge types between the nodes, you can also consider this here. In the example above we only have one type, therefore the edge_index looks like this (a triplet):

data['user', 'rating', movie'].edge_index = edge_index

Finally, we can add the labels of the link-prediction setup like this. In Heterogenous graphs, you can also have different labels between different entities, but here we just have one type. If you build recommender systems specifically, then you might also find this tutorial on bipartite graphs helpful.

data['user', 'movie'].y = y

Reference

--

--

Tejpal Kumawat

Artificial Intelligence enthusiast that is constantly looking for new challenges and researching cutting-edge technology to improve the world !!