Graph Neural Network training with Feast

An application of a Feature Storage in a GNN training pipeline

Valerio Piccioni
LARUS
10 min readMar 31, 2022

--

Graph Neural Networks (GNNs) are a class of deep learning algorithms designed to be used on graph data and to achieve state of the art performances on this kind of structures (e.g. social networks, molecules, transaction networks, etc.). Erroneously, many people believe that the training phase is enough to bridge the gap between development and production of a machine learning model, but this is not the case. MLOps(Machine Learning Operations) have been invented in order to fill this gap, and in this article we will focus on one of their core components: the Feature Storage. Feature Storage application in a training or inference pipeline is not a prerogative of GNNs as it can be applied alongside any deep learning model. Since Graph Data is not the usual data when dealing with deep learning or feature storage I think that an example of synergy between these concepts is definitely interesting.

Hints of GNNs

Graph data is complex. The reason is that conventional ML and Deep Learning tools are specialized in simple data types (e.g sequences and grids). This kind of data types are usually fixed in size and sortable. This cannot be said about graphs as we don’t have a fixed form, but a variable size of unordered nodes, where each node can have a different number of neighbors, creating a dependency between instances that conventional models do not take into consideration. This can also be further complicated by adding concepts such as heterogeneity of nodes. GNNs try to address this problem. The core idea is that a node can be expressed as a function of its neighbors. In order to do so, we determine the computation graph of a node recursively getting its direct neighbors, then their neighbors and so on.

Credits for the image: http://web.stanford.edu/class/cs224w/slides/06-GNN1.pdf

We start initializing each node information:

For every layer in the network the following operations are applied:

  1. Message Passaging: each node pass on to its neighbors its information
  2. Neighbor message aggregation: each node aggregates the information received (e.g. sum, mean, etc.)
  3. Neural Network module: Apply the specific model function (e.g. linear) and activate it

In maths formula for a convolutional network it becomes:

What is a Feature?

We can see that each node is first initialized at step 0 with its feature vector, but what is exactly a feature? It is any individual measurable attribute of an element in the graph (node or edge). For example if we have a network of people like for example a social network, where each node is a person, a single feature can be seen as one of the person attribute like age, height, sex and so on. The overall information of an element is defined by its feature set (or vector). Features are transformed and updated during the train phase of a neural network at each layer. These updated features that are the output of some models, could also be used as the input of other models (e.g. embeddings).

Feature Storage

Managing features on a large scale can be hard. To better manage these complexities, feature stores have been designed.

Credits for the image:https://www.tecton.ai/blog/what-is-a-feature-store/

Their primary goals are:

  • Store and manage features
  • Handle time versioning of features
  • Handle the retrieve in both train and inference phase
  • Transform raw data through the execution of data pipelines (optional)
  • Search and discover useful features in data (optional)

There are several Feature Storage on the market, the likes of Amazon SageMaker Feature Store, DataBricks Feature Store, Tecton and so on, but too few of these are open source. The one I want to talk about in this article is Feast. Feast (Feature Store) is a relatively new (version 0.19 at the time of writing) data system for managing and serving features to machine learning models in production. Feast is able to serve feature data to models from a low-latency online store (for real-time prediction) or from an offline store (for scale-out batch scoring or model training). Some may find it difficult to differentiate between Feast and other more known tools, so before going into details, I will present a simple table to clarify what id does, what it does not aim to do and what it partially do or will do with future updates:

Feast logical architecture and concepts

Credits for the image: https://docs.feast.dev/getting-started/architecture-and-components/overview

The image above is the logical architecture of Feast, as new features arrive, they will be saved in the Offline Store. We will later see that Feast SDK can be used to retrieve features from both stores. This is made possible by defining logical structures that will be stored in the Registry (the Feast Apply arrow in the image). The best features can be uploaded via feast CLI to the Online Store from the Offline Store (this process is called materialization) and will be made available for the inference phase of the model in production. In order to better analyze this architecture we need to first understand the core concepts behind it.

  • Data source: it refers to raw underlying data (e.g. a table in SQL db)
  • Point-in-time data : feast uses a time-series model to represent data (i.e. it requires a timestamp event column in the data source)
  • Entity: Collection of semantically related features to map the domain of their use case. Usually it is the primary-key column/s of the data source
  • Feature view: An object that represents a logical group of time-series feature data as it is found in a data source.

Suppose we have a SQL table containing some driver features as data source, a feature view can be seen as a specific time-related subset of it:

Credits for the image: https://docs.feast.dev/getting-started/concepts/data-source

Now that we better understand the above concepts we can go back to explain the Offline Store. Usually, it is a cloud SQL Data Warehouse like BigQuery, Redshift or Snowflake, where historic time-series features are saved. This somehow places a considerable limit on the need for the Offline Store to be a SQL Database as the interfaces are inspired from the Data Warehouses mentioned above. Feast does not manage the Offline Store directly but, instead, uses its core concepts definitions in the registry as an interface for querying existing features. Feast can query only a single Offline Store at a time. The Online Store on the other end is usually a low-latency value lookups database and it is managed directly by Feast. In case of real time prediction, features can be digested directly by the online store from a stream source without materializing them from the Offline Store

Feast production architecture. Credits for image:https://docs.feast.dev/how-to-guides/running-feast-in-production

A practical example: setup

After a general smattering, let’s go into more detail by looking at some setup before going into code. First of all we will need an example dataset and in this case we will use DBLP. DBLP is a computer science bibliography website. We adopt a subset of it inspired by the paper MAGNN: Metapath Aggregated Graph Neural Network for Heterogeneous Graph Embedding, containing 14475 authors, 14376 papers, 8920 terms, and 20 publication venues after data preprocessing. The papers are divided into four research areas (Database, Data Mining, Artificial Intelligence and Information Retrieval). The dataset is stored in a Neo4j database.

The dataset graph schema

Our goal will be the Paper node classification into one of their respective research areas. This dataset is pretty easy as models can bring really good results with just a hundred training nodes. I’m saying this because in this article we will not be focusing on performance but just show how Feast can be integrated in a GNN training pipeline. For this example we will use a Postgres instance for the Offline Store and Registry with TimescaleDB plugin for better time series management. We use it through the Postgres community plugin for Feast. In our instance we will have some precomputed features for each node with fastRP and Graphsage algorithms (created with the gds library of Neo4j).

Hints of the author table in the offline store

Each node category is translated directly into a table where each feature is represented by a column. Now, this dataset does not have edge features but if it had them, we could have made the relative table like in an N*N relationship in an E-R diagram as the data source for the edge type. Feast also supports point-in-time join to get valid features from multiple views together (but it is very slow if we map graphs this way so it is still better to get each element features one category at a time in my opinion). Now that we have all on and installed the feast SDK we can initialize our feature storage repository. We can do that via CLI with the command

With this command it will initialize a folder with some default data. The most important file in this folder will be the yaml configuration:

In this file we will configure where the SDK can find the Registry, Offline Store and Online Store and which provider we will use (local, AWS or GCP). Now from the snippet above you can see that there is no definition of the Offline Store at all because by default it is set as local files (parquet), together with SQLite files for Registry and Online Store. The yaml configuration for this article demo will instead use Postgres for both the Registry and the Offline Store and Redis for the Online Store:

A practical example: code

Now that we have defined the setup we can look at the training pipeline code. For building the GNN model we will use the DGL library with Pytorch backend. First we need to create the DGL graph retrieving the topology from Neo4j with Cypher. We need to add reverse relationships so that the Paper nodes will have incoming edges for the messaging passing framework.

We are using a custom Neo4j driver that use an arrow flight server for retrieving data from Neo4j (https://github.com/neo4j-field/neo4j-arrow) but the same thing can be achieved with the official Neo4j python driver

The graph print will look like this:

Next step will be the creation of the feature definitions with python objects from the feast SDK.

The ttl parameter represents the time validity of the feature view

After saving all the objects in the registry with the apply method we can use CLI to retrieve information about them:

Now that we have values in the Registry we can get our features for training. In order to do that we use the get_historical_features method:

Here you can see that entity_df is actually a SQL query, the Postgres plugin lets you decide whether to retrieve features with a pandas dataframe with entry columns or with a SQL query. I think the query can give more flexibility as you can add a where clause, here for example I decided to restrict the features to retrieve from the last 7 days (as in the ttl) to the last 3 days.

Here we split the dataset into train, validation and test sets:

Next step is GNN Model definition. We will use a standard 2 Layers Heterogeneous Convolutional Network:

And in the end we train for 100 epochs and save the best accuracy value:

Now all we have to do is repeat the previous steps for retrieving Graphsage features and training a new model so we can choose the best overall features to materialize:

Here the command is materialize-incremental, which saves on the online store all the valid feature views up to a date. There is also the command materialize which takes as input a start date and an end date for validity interval.

Here is the materialization output:

We can see that fastRP performed better than the Graphsage, probably because on this dataset topology mattered more than node attributes.

Now that we have the best model together and the best features are stored in the Online Store we can start the inference phase. In order to do so we will use the get_online_features method.

and this is the final output:

Conclusions and Considerations

As stated before, this dataset is very easy and our focus here was not on GNN performance. As more and more datasets with many features are taken into consideration, a tool that can manage them is required. I’m a little bit disappointed that given the Offline Store interface a SQL database is somehow mandatory but because many companies pay for a cloud SQL Data warehouse I can somehow understand this choice. There are some interesting things in Feast Roadmap right now (like Web UI and Java client) and I think the tool is evolving in the right direction as more features arrive. I would like to thank Riccardo Corò for his help in writing this article.

--

--