An Introduction to Graph Machine Learning with Tensorflow and TigerGraph

Michael(Dezhou Chen) Chen
10 min readJul 20, 2021

--

This project was co-created with Daniel (LinkedIn)

If you would like to follow along with this blog — Open the Google Colab Notebook [TigerGraph with Tensorflow.ipynb — Colaboratory (google.com)]

Overview and Objective:

In this blog, we will walk through a demonstration using Tensorflow with the output of a graph to run a prediction whether a user will like a movie. The data is collected from a TigerGraph database using PyTigergraph(Python package). The Data collected is via a REST call on a user built query. The output from the call is coming to the notebook in a JSON format. Using Pandas, we can transform that data into a dataframe. After transforming data, we get the data ready for a basic ML model.

This blog will walk you through how to create a MyGraph solution with TigerGraph, how to extract data using GSQL as well as how to build ML model using tensorflow framework. Let’s get started!

Sections Discussed:

Part I: Create a (Free) TigerGraph Solution at https://tgcloud.io/

Part II: Setup Your Notebook

Part III: Data Preprocessing

Part IV: Set up Tensorflow

Part V: Training and Testing Model

Part VI: Model Optimization with Gradient Descent

Part VII: Model Optimization with Heat map feature selection

Part I: Create a (Free) TigerGraph Solution at https://tgcloud.io/

Firstly, you’ll need to create a free solution at https://tgcloud.io/. It’s a cloud platform where we’ll host our graph database. Go to https://tgcloud.io/ and create an account if you haven’t already. You can sign up with Google, LinkedIn, or with an email.

Once you have signed up or logged in, navigate to the “My Solutions” tab in the left sidebar then press the blue “Create Solution” button on the top right.

To get started you will need a TigerGraph cloud instance with the Recommendation Engine (Movie Recommendation) v3.1.1 Starter Kit. Don't select the blank one because you won't have data or a model in there.

Once you click “Next”, we will see instance setting as below. Simply keep the default setting.

As for the Solution Settings, you can change Solution name as well as initial password. In general, we will recommend to set the password as “tigergraph”.

Simplely “Submit” before moving on.

Once you have the box provisioned with the Starter Kit mentioned above open GraphStudio.

On the top left you will see Global View and you shall see something like this before moving on:

Click on that and choose MyGraph. Once you select MyGraph, Global View will go away and it will look simliar to this.

Navigate to the tab called Load Data. You will see a schema with person-(rate)-movie along with some data files pointing to the vertex and edges. Click the "play button" to start the load process.

Once you finishing loading the data, you will have:

Perfect. Now you Graph is UP and your Data is LOADED. Let’s move on.

Part II : Setup Your Notebook

Step I: We will need the following packages installed on your Google Colab. pyTigerGraph is the connector we will be using to interface with TigerGraph.

!pip install pyTigerGraph!pip install pandas!pip install flat-table!pip install tensorflow!pip install -q sklearn

Next, we need to import and implement these packages that we install previously.

!pip install pyTigerGraph!pip install pandas!pip install flat-table!pip install tensorflow!pip install -q sklearn

Step II : Setup Server Connection to Notebook

Now, let’s setup the box with the appropriate settings. When you provisioned your box you gave it a Unique URL. Please insert that URL into the host parameter. The other parameter you will need to modify is the password. In general, the default password is tigergraph Replace the password with that password you inputed during the provision process. Run the cell. We will print the token to verify the connection is working.

conn = tg.TigerGraphConnection(host=”https://bc7df6cba5694c1c99b76130279ecc3e.i.tgcloud.io/", graphname = “MyGraph”, gsqlVersion=”3.0.5", username=”tigergraph”, password=”tigergraph”, useCert=True)secret = conn.createSecret()token = conn.getToken(secret, setToken=True)print(token)

Let’s check out for the token!

(‘lksb5a6bcr7rf01ofc3aj3n8q81bbtu8’, 1629258879, ‘2021–08–18 03:54:39’)

You are connected! Let’s use the connection to see what endpoints exist on the box.

results = conn.getEndpoints()print(results)

You will see something like these:

{‘DELETE /graph/{graph_name}/delete_by_type/vertices/{vertex_type}/’: {‘parameters’: {‘ack’: {‘default’: ‘all’, ‘max_count’: 1...

We will need to create a new query and install it on the box. We will use a graph called MyGraph . We will accept a user parameter which regards as a vertex and get all the ratings that that user gave to movies.

conn.gsql(‘’’USE GRAPH MyGraphCREATE QUERY userData(VERTEX<person> p) FOR GRAPH MyGraph {// Feature Extraction of person: movieID, movieTitle, userRating, term, termRating// Sample Param = 271SumAccum<float> @rating;src = {p}; //From the UserS1 = SELECT tgt FROM src:s -(rate:e)-> movie:tgt //Grab all the movies that they ratedACCUM tgt.@rating += e.rating; //Also add a local varible of that users ratingPRINT S1[S1.title as movieTitle, S1.@rating as userRating, S1.genres as genre];}INSTALL QUERY userData‘’’, options=[])

Part III : Data Preprocessing

Now the query is installed. Notebook is connected. Let’s call the endpoint that we created to fetch the data about person 118205

preInstalledResult = conn.runInstalledQuery(“userData”, {“p”:”118205"})parsR = (preInstalledResult)print (parsR) # full return of REST call

Then we have information regarding person 118205 :

[{‘S1’: [{‘v_id’: ‘3989’, ‘v_type’: ‘movie’, ‘attributes’: {‘movieTitle’: ‘One Day in September (1999)’, ‘userRating’: 4, ‘genre’: ‘Documentary’}}…

Next, we need to convert JSON data structure

df = pd.DataFrame(parsR[0][“S1”]) # Grab only the data we are returningdf # take a look at the data format

Then we have:

We need to normalize the data in the attributes column

df_t1 = flat_table.normalize(df)df_t1[‘attributes.userRating’] = df_t1[‘attributes.userRating’]/5 # divinding by 5 to get a decimal rating which will be used in modeldf_t1 # Output DataFrameThen we have : 

Then we have:

Let’s rename the data column:

df_t2 = df_t1.rename(columns={‘v_id’:’ID’,‘v_type’:’Type’,‘attributes.movieID’:’movieID’,‘attributes.movieTitle’:’movieTitle’,‘attributes.userRating’:’userRating’,‘attributes.genre’:’genre’})df_t2 # Output DataFrame

Then we will have:

Let’s breakout the genres into separate rows.

data = {‘ID’: [],’Type’: [], ‘movieTitle’: [],’genre’: [],’userRating’: []}for i in df_t2.index:genres = df_t2[“genre”][i].split(“|”)for e in genres:data[“ID”].append(df_t2[“ID”][i])data[“Type”].append(df_t2[“Type”][i])data[“movieTitle”].append(df_t2[“movieTitle”][i])data[“genre”].append(e)data[“userRating”].append(df_t2[“userRating”][i])df_t2 = pd.DataFrame(data, columns = [‘ID’,’Type’,’movieTitle’,’genre’, ‘userRating’])print(df_t2)

Then we have:

Next, let’s pivot on the genres and turn them into columns.

# Pivoting the genres into coulmn headersdf_t3 = df_t2.pivot(index=’ID’, columns=’genre’, values=’userRating’)df_t3 = pd.DataFrame(df_t3, columns = [‘Action’, ‘Adventure’, ‘Animation’, ‘Children’, ‘Comedy’, ‘Crime’, ‘Documentary’, ‘Drama’, ‘Fantasy’, ‘Film-Noir’, ‘Horror’, ‘IMAX’, ‘Musical’, ‘Mystery’, ‘Romance’, ‘Sci-Fi’, ‘Thriller’, ‘War’, ‘Western’])df_t3 = df_t3.fillna(0)df_t3 # Output DataFrame

Then we have :

Let’s put the data frames together to see what we end up with.

# Put dataframes togetherdf_t4 = pd.merge(df_t2, df_t3, how=’outer’, on=[‘ID’])df_t5 = df_t4.drop(columns=[‘genre’])df_t6 = df_t5.drop_duplicates(subset =”ID”)df_t6 # Output DataFrame

Then we have :

Part IV : Set up Tensorflow

Since dataframes are ready to be used, let’s pull out features which are genres.

# Features to grabfeatures = [‘Action’, ‘Adventure’, ‘Animation’, ‘Children’, ‘Comedy’, ‘Crime’, ‘Documentary’, ‘Drama’, ‘Fantasy’, ‘Film-Noir’, ‘Horror’, ‘IMAX’, ‘Musical’, ‘Mystery’, ‘Romance’, ‘Sci-Fi’, ‘Thriller’, ‘War’, ‘Western’]

Next, let’s setup the tensor slices using the features and user ratings.

# Setting up your data for TensorFlowdataset = (tf.data.Dataset.from_tensor_slices((tf.cast(df_t6[features].values, tf.float64),tf.cast(df_t6[‘userRating’].values, tf.float64))))tf.keras.backend.set_floatx(‘float64’)

Let’s take a look at the data

# Checking Datafor feat, targ in dataset.take(10):print(‘Features: {}, Target: {}’.format(feat, targ))Then we have:

Then we have:

Let’s look at userRating

# Checking Datatf.constant(df_t6[‘userRating’])

Then we have:

<tf.Tensor: shape=(9254,), dtype=float64, numpy=array([0.8, 0.6, 0.6, …, 0.6, 0.6, 0.6])>

Part V: Training and Testing Model

Now, let’s get a training dataset and a testing dataset we can use our model on.

# Prep datasets, one for model creation, other for model validationdataset = dataset.shuffle(len(df_t6)).batch(1)train_dataset = dataset.take(int(len(df_t6)*.8))test_dataset = dataset.skip(int(len(df_t6)*.8))print(len(list(train_dataset)))

Then we have:

7403

Next, let’s setup our model. Feel free to tweak any of these settings. We will be trying to reduce the error of prediction for this model.

def get_compiled_model():model = tf.keras.Sequential([tf.keras.layers.Dense(8, activation=’relu’), #layer 1# tf.keras.layers.Dense(20, activation=’relu’), #layer 2tf.keras.layers.Dense(1, activation=’sigmoid’) #out])# training operations for optimization, loss and metricsmodel.compile(optimizer=’adam’,loss=’mean_squared_error’,metrics=[‘mean_absolute_error’])return model

Let’s start the training process!

# Training Datamodel = get_compiled_model()model.fit(train_dataset, epochs=20)

Then we have:

Last step, it’s time to take our model and test it against the test data set to see if our model worked correctly. Try making modification to the model to see if you can increase a better prediction results.

# Testing Dataresults = model.evaluate(test_dataset)

Then we have the result:

Part VI: Model Optimization with Gradient Descent

Let’s optimize model to see how model performance improves based on both value of loss and mean absolute error! First of all, we will implement gradient descent optimizer. Feel free to play around with the learning rate to see how model performance changes!

##implement with Gradient Descent Optimizer##set up the learning rate as 0.02tf.compat.v1.train.GradientDescentOptimizer(learning_rate = 0.02, use_locking=False, name=’GradientDescent’)

Secondly, We will train the model with parameters such as batch size, epochs, validation frequency.

model_version1 = model.fit(train_dataset,batch_size = 80,epochs = 20,validation_freq = 20,)

Then we have:

Let’s evaluate our model after optimization!

#model result after optimization#the lower loss and the lower mean_absolute error, the better prediction for the modelresult_version1 = model.evaluate(test_dataset)

Then we have:

Since both value of loss and mean absolute error are less than that of initial model performance, then we improves model performance!

Part VII: Model Optimization with Heat map feature selection

Let’s define dataframe with target variable and non-target variable

target_v = df_t6[‘userRating’]non_target = df_t2[[‘genre’, ‘movieTitle’]]

Let’s look at our data:

target_v

Then we have:

Let’s check out another one:

non_target

Then we have:

We will extract three features within dataframe:

three_target = df_t2[[‘genre’, ‘movieTitle’,’userRating’]]three_target

Then we have:

Let’s create a heat map to select feature by correlation coefficient

##feature selection through correlation coefficientimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inlinecor = df_t6.corr()plt.figure(figsize = (20,10))sns.heatmap(cor, annot=True)

Then we have:

Let’s select top 9 features with the highest correlation coefficient

## select the top 9 featuresfeature_2 = [‘War’, ‘Romance’, ‘Crime’, ‘Documentary’, ‘Fantasy’, ‘Film-Noir’, ‘IMAX’, ‘Animation’, ‘Adventure’]

Let’s set up data for the tensorflow

dataset = (tf.data.Dataset.from_tensor_slices((tf.cast(df_t6[feature_2].values, tf.float64),tf.cast(df_t6[‘userRating’].values, tf.float64))))tf.keras.backend.set_floatx(‘float64’)

Feel free to check out the data with similar processes as what we did previously.

Let’s prepare model for training and testing respectively.

dataset = dataset.shuffle(len(df_t6)).batch(1)train_dataset = dataset.take(int(len(df_t6)*.8))test_dataset = dataset.skip(int(len(df_t6)*.8))print(len(list(train_dataset)))

Then we have:

7403

Let’s set up layer sizes and define which optimizers to use

def get_compiled_model():model = tf.keras.Sequential([tf.keras.layers.Dense(8, activation=’relu’), #layer 1# tf.keras.layers.Dense(20, activation=’relu’), #layer 2tf.keras.layers.Dense(1, activation=’sigmoid’) #out])# training operations for optimization, loss and metricsmodel.compile(optimizer=’adam’,loss=’mean_squared_error’,metrics=[‘mean_absolute_error’])return model

Let’s train the model!

model = get_compiled_model()model.fit(train_dataset, epochs=20)

Then we have:

Let’s check out the result!

result_version2 = model.evaluate(test_dataset)

Then we have:

1851/1851 [==============================] — 2s 962us/step — loss: 0.0108 — mean_absolute_error: 0.0792

After comparing for two kinds of model optimizations, we decide to select the optimization implement with Gradient Descent since value of loss is 7.7983e-04 and the value of mean absolute error is 0.0148 which are less than that of heat map feature selection optimization in both loss and mean absolute error.

Congrats! we done! If you like this content, let me know via claps and I will add more blogs like this in the future. Don’t hesitate to reach out me via TigerGraph’s community Forum or Developer chat! -[TigerGraph (discord.com)]

--

--