Serverless and Recurrent Neural Networks with Fn, GraphPipe and TensorFlow

Capturing Time (Image © Ralf Mueller)

The last article First steps in serverless with fnproject.io marked the start of my journey into serverless computing. My first proof of concept in this area was quite promising so I have decided to continue on this path and do a couple more experiments. I have a set of use cases in mind where serverless architectures might be beneficial for certain integration scenarios that include Systems, People and Developers.

Overview

In this article I’m going to explore the use of modern Machine Learning and AI techniques in the context of serverless computing. I’m putting together an example that does the following:

  • Function will be invoked with a cloudevents.io conforming event. The vigilant reader might notice that I’ve been using CloudEvents in my previous example. This is not by accident, I’m envisioning an architecture that is based on standards and CloudEvents seems a natural choice here for multiple reasons; it is part of the Cloud Native Computing Foundation (although in Sandbox status at the time of this writing), it’s a simple but extensible data format, etc.
  • Function will extract the Data portion of the CloudEvent and then calls into a Machine Learning model for scoring.
  • Function will create a CloudEvent based response with the result of scoring against the Machine Learning Model.

As with my previous article, this is a very simple and contained use case. However it should give some ideas on what can be done in a larger context. Also, since I’m still a newbie in both the Go programming language and serverless, I’d like to keep the examples as small and simple as possible for the moment.

For the use case itself I picked something in the area of Time Series analysis, more precisely forecasting time series data. Time Series is a series of data points indexed in time order. Time Series analysis comprises methods of analyzing time series data to extract meaningful information. Time series are quite powerful and well known; there are a vast amount of literature and open-source tools to deal with Time Series Data and Analysis. Time Series are used in a variety of domains including:

  • Analysis of multi-sensor networks (aircrafts, nuclear power plants, manufacturing systems, etc.).
  • Forecasting of financial data: Stock, Mortgage, Utility price, etc.
  • Analysis and Prediction of complex IT systems. One prominent example here is Prometheus, which internally uses a time series database to store metrics reported by software systems. Prometheus is also part of the Cloud Native Computing Foundation. It offers only very basic time series analysis though, mostly by range queries and some sort of linear regression for forecasting.

Up until recently, the methods used for time series analysis and forecasting were pure statistical methods like:

With the availability of modern Machine Learning Frameworks like TensorFlow, Keras and others, it almost has become good practice now to predict time series data using Recurrent Neural Networks (RNN’s) or Long short-term Memory Networks (LSTM’s). I’m not going too deep into this discussion here and which technique should be used for which use cases. For this article I just wanted to pick an example with a broad range of use cases that isn’t too complex but also isn’t a “Hello Neural Network” kind of thing.

Machine Learning Environment

The tech-stack for this article is a bit more involved as in my earlier post and includes the following:

  • Anaconda with Python 3.x and a rich set of Python, Machine Learning libraries and Jupyter notebook. I’m not going into the details on how to install and configure Anaconda, there is already a plenty of material on this subject. You can download Anaconda here.
  • Jupyter Notebook. I’m a big fan of notebooks and for this example I’m going to use Jupyter notebook, which is quite popular in the Python world. In a future article involving Machine Learning, Graph or Databases I will switch to Oracle Data Labs Studio though since it offers a much richer notebook experience. Make sure to watch this Interactive Data Analytics and Visualization with Collaborative Documents video that my Oracle Labs colleagues put together.
  • TensorFlow. For this article we’re going to use TensorFlow as our ML implementation of choice. In subsequent articles I’m going to use a variety of ML and AI libraries, both open-source ones and Oracle specific ones.
  • GraphPipe. This is a protocol and collection of software designed to simplify machine learning model deployment and decouple it from framework-specific model implementations. GraphPipe was recently open-sourced by Oracle and can be cloned from https://github.com/oracle/graphpipe.

For the serverless infrastructure, I’m using Fn Project. Make sure to check my previous article on how to setup a complete Fn environment with docker compose.

Setup

I’m not going into the details of creating a development environment for this example, however I’d like to note few traps and pitfalls that I ran into while putting all the bits and pieces in place. First and foremost, it is important to get the correct libraries and versions. I had a hard time and did run into couple of environment issues due to incompatibilities of libraries in my Anaconda setup. If not already done so, you might want to install the following packages into your Anaconda environment:

shell commands to install conda packages

This will install the appropriate libraries into your Anaconda environment. In my environment, I’m using the following library versions:

TensorFlow: 1.10.0
Pandas: 0.23.1
Numpy: 1.15.1
Matplotlib: 2.2.3

The problem I had initially was that when importing TensorFlow into my Jupyter notebook, i got tons of exceptions with no clear indication of what was wrong. With some research, I figured that the dask library was too old. This was easy enough to fix.

> conda update dask

This caused an update of pandas as well and after this, things started working. You probably might not run into this with a brand new install of Anaconda. If you have Anaconda installed for quite some time, it is worth updating certain libraries to prevent some nasty issues.

GraphPipe

GraphPipe is a new development by Oracle which got open-sourced recently.

GraphPipe Architecture

GraphPipe is NOT a new Machine Learning framework. GraphPipe is a machine learning model serving protocol and specification. It provides a simple and efficient reference model for serving ML models from TensorFlow, Caffe2 and ONNX and a specification based on flatbuffers for ultra-fast communication of a Client with the ML framework of choice. Client implementations exist for Go, Python and Java.

The GraphPipe Specification defines a thin protocol for Tensors (multi-dimensional array of data with a specific shape and type) based on flatbuffers for the following:

  • Request and Response of the GraphPipe Server
  • Metadata Request and Response

The protocol is intentionally kept simple so that new GraphPipe Server implementations can easily be developed. A single GraphPipe server comes in the form of a Docker image and serves a single model. This is in contrast to some generic ML services with a more involved and fat input/output contract, manageability, etc.

So in this regard I consider GraphPipe to fit well within a Serverless Architecture since it comes with a very efficient communication protocol based on flatbuffers, and is quite compact and contained so that it serves the short-lived serverless paradigm quite well.

Modeling the Time Series Use Case

Lets get our fingers on the keyboard and do some work. You might want to start a new Jupyter Notebook for this and use Python3 as implementation language.

Import required libraries

We start our notebook by importing all the required libraries.

Import of required packages

Generate some test data

For this article we are going to generate random time series data. In a later article I will show how to use real time series data for some selected use cases.

Create some random data and plot it

This will show a plot that might look something like this:

Sample Time Series Data

The situation shown here might be a typical IT resource utilization curve and we can clearly see an upward trend starting around 2007 and by forecasting the curve we might re act to an eventual over-utilization before it happens.

Prepare Data for ML

Next we’re going to create an array from the Time Series data and setup the x and y values and batches for x and y and then the data is split into training and test data.

Prepare data for RNN and split into training and test data

Creating the RNN in TensorFlow

We’re going to setup TensorFlow with a basic RNN cell and then setup a dynamic RNN and use Adam Optimizer. Since this is a regression use case, we’re using a MSE loss function and a Rectified Linear Unit (ReLU) as activation function for the RNN cell.

Prepare RNN, Optimizer and Loss function

Training the RNN model

We’re now going to train a RNN model. Just to re-cap on what we have done so far and what the plan is see the following picture.

Illustration of RNN Model Training and Testing

We did split the time series data into two pieces; one piece outside of the red rectangle (data from 2000 — roughly mid 2016) and the other piece in the red rectangle. We’re going to use the data outside of the red rectangle for model training and then we can test the model predictions against the data in the red rectangle in the left figure. The right figure is a visualization of Actual vs. Forecast where the blue dots represent the actual data (compare this to the curve in the red rectangle on the left, it is identical except its on a different scale). The predicted values from the RNN model are represented by the red dots and come actually quite close for a first round of training.

But first things first, lets continue with the training of the RNN model code. We start a new TensorFlow Session and iterate over the number of epochs and train the RNN on the data stored in x_batches and y_batches.

At the end of the loop we run the test of the RNN model by predicting the value given the input X_test.

Finally we’re going to store the model on local file system so that we can consume it with GraphPipe Server for TensorFlow.

Train the RNN model, predict using the test data and save the model to local file system

While we run the epochs, we can monitor the progress of the training by printing the value of the Mean Square Error to the console. A typical output might look like this:

Decreasing MSE by increasing number of epochs

The output shows a decreasing value for the Mean Square Error with increasing number of epochs. This shows that the RNN is improving with each epoch.

Testing the RNN model

Now that we have trained the model, we can test it by using the data we split from the original time series data.

Plot the comparison Actual vs. Forecast

The output might look something like this. Please note that the blue-dotted curve is a piece of the actual data from the initial time series data set, namely the piece in the red rectangle from the figure above. The red-dotted curve is the forecast which comes quite close to the actual curve.

Test of the RNN model by comparing Actual vs. Forecast

This isn’t bad for a first test. One might even further improve the RNN by running more epochs and/or increase the number of (hidden) RNN units in the RNN (variable hidden in the Jupyter notebook). Another parameter to play with is the learning_rate for the Adam Optimizer. In our example we selected a slow learning rate of 0.001.

Starting a GraphPipe-TF Server with the RNN model

Now that we have a model created and saved in the correct format we can start a GraphPipe Server to expose the RNN model for predictions. Please note that we need to put the file rnn_ts_model.pb on a volume mount that is accessible by docker. Check the example below which mounts the local directory $PWD/models to /models in the Docker image. It is expected that rnn_ts_model.pb does exist in directory $PWD/models on your host machine.

docker run -it --rm \         
-v "$(pwd)/models:/models/" \
-p 9000:9000 \
sleepsonthefloor/graphpipe-tf:cpu \
--model=/models/rnn_ts_model.pb \
--listen=0.0.0.0:9000

That’s it! We have successfully trained a Recurrent Neural Network in a few lines of Python code using TensorFlow and exposed it via GraphPipe Server for predictions!

Implementing the Fn function

Let’s continue with implementing the Fn function for our use case.

Create the Shell Go Fn code and deployment descriptors

To begin with, we’re initializing a Fn app directory with a Go runtime and use HTTP as a trigger.

fn init --runtime go --trigger http gpfn

This will create a new gpfn directory that contains some Go function boilerplate and the required configuration file required to deploy the function.

Import required Go libraries

First we need to import some Go libraries. For this you’d need to install the Fn Go FDK and GraphPipe Go client library as following into your Go environment.

> go get github.com/oracle/graphpipe-go
> go get github.com/fnproject/fdk-go
> go get github.com/fnproject/cloudevent

Next, the Go function (func.go) should import the required libraries.

Go main package and imports for Fn and GraphPipe

Adding the function handler

As with our previous example, we’re going to add the Fn function handler as follow.

Implementing the Handler

Next, we’re going to implement the handler as a three-steps process.

  1. Convert the CloudEvent data into something that we can send to the GraphPipe Server.
  2. Call into the GraphPipe Server that serves our RNN time series forecasting model.
  3. Get the result from GraphPipe Server and transform it into a CloudEvent.
Fn function handler with CloudEvent handling and GraphPipe Server invocation.

Tweaking Gopkg.toml

In this example, we need a slight modification of the default Gopkg.toml file that was created when we boostraped the function. We need to tell that we want to use the GraphPipe code from master branch by adding the following.

[[constraint]]
branch = "master"
name = "github.com/oracle/graphpipe-go"

Fn deployment file

Our func.yaml is untouched and quite simple.

Deploying the function to an Fn Server

Finally, we can deploy the function to an Fn Server and add a trigger to be able to invoke it.

> fn deploy \ 
--app gpfnapp \
--registry phx.ocir.io/oicpaas1/ralmuell/fn

You can now list the triggers and their corresponding HTTP endpoints of the ‘gffnapp’ app. Just issue the command below

> fn list triggers gpfnapp

which will should gives something like this…

FUNCTION        NAME            TYPE    SOURCE          ENDPOINT
gpfn gpfn-trigger http /gpfn-trigger http://localhost:8080/t/gpfnapp/gpfn-trigger

Last not least, we need to add a config for the variable GP_SERVER_URL that we’re using in our Fn Go Code. This config variable should be set to the URL of the GraphPipe Server.

> fn config function  \
> gpfnapp gpfn GP_SERVER_URL http://localhost:9000

We can check the config with the following command…

> fn list config function gpfnapp gpfn

and it should produce something like …

KEY           VALUE
GP_SERVER_URL http://localhost:9000

Testing the Function

Now that we have everything together, lets test the function. For this we’re going to create a CloudEvent and store it in a file, ex. ce.json. Please note that our RNN model was created in a way that it expects an input of shape (1,20,1) or multiples of 20. So for the Data portion we have to pass a three dimensional array here.

Sample Input Cloud Event

To test the function, we can pipe the content of ce.json into the gpfn function using the Fn CLI…

cat ce.json | fn invoke gpfnapp gpfn

The output will create another CloudEvent with the result of the predictions. The numbers should be identical to the red dots in above diagram.

Outlook

Thanks for reading all the way through the end! Now that I have few functions in place, I guess it is time to work on some infrastructure pieces again. In my next article, I’m going to explore and test the following:

  • Bring an Event Hub into place (for example Oracle Event Hub) which consumes and publishes CloudEvent conforming events.
  • Work on a Kafka Consumer Interceptor implementation that takes CloudEvents from a Topic, evaluates the “destination” extension of the event and calls into a Fn function.
  • Explore the world of Flatbuffers and Protocol Buffers a bit more and use them as a protocol between the Fn world and some core complementing services (ex. DMN, Machine Learning, etc.). I see this as a promising architecture and a good match between the serverless and microservices world where there would exist some core microservices that could be consumed by functions with a ultra-fast and low latency protocol in between. This needs definitely further investigations.

There is so much more to do and investigate, interesting times ahead of us!

Acknowledgement

This article wouldn’t have been possible without the help of Anthony Young and Vish (Ishaya) Abrams from the GraphPipe team who helped me in getting the Python code correct and the RNN model saved for consumption by graphpipe-tf server. I’m really thankful for their responsiveness and their support! I’d also like to thank Chad Arimura and his team from the Fn Project Team for their support on all things Fn related.

References