E2E Kubeflow Pipeline for time-series forecast — Part 2

Published in

CI&T

4 min readDec 19, 2019

This is the second and final post related to how to build an e2e pipeline for time-series forecast using Kubeflow.
Feel free to read the first post if you haven’t yet.
It covers the motivation to solve such problem, a bit of exploratory data analysis(EDA) and technical aspects of Kubeflow.

This post will present an overview of every Kubeflow pipeline components designed for the proposed solution.
Feel free to take a look at the source code.

Pipeline components

Kubeflow provides a Python DSL to define each component as a ContainerOp, in a nutshell, each pipeline component is defined as a docker container.
When you define a ContainerOp, you must provide:

a container image URL
entry command
script arguments
file outputs (if any) to do IO across components
output artifact paths, to export metrics or HTML widgets

Figure 1: Source code of pipeline definition.

Once with the pipeline defined, it can be built and a tar.gz file will be created, When uploaded at Kubeflow UI, the pipeline will look like this:

Figure 2: Designed pipeline for time-series forecast

To run an experiment, you just need to click at Create Run button and enter the pipeline parameters like this:

Now let’s take an overview of every pipeline steps

Preprocessing

The preprocessing components are designed to primarily process the input data stored at BigQuery into tfrecords shards to be further consumed at the training step.
The preprocessing part of the pipeline is split into the following steps: read-metadata, bq2tfrecord, data-valitation,

Read Metadata

The first preprocessing step is responsible to list both the eligible community areas and the z-norm statistics (mean and standard deviation) for each community. These outputs are fed into the bq2tfrecord step.

This step is built upon Apache Beam and currently can be run over DirectRunner or DataFlowRunner.

BigQuery to TFrecords

The second preprocessing step reads all records from BigQuery and outputs a tf.Example per windowed data.

The raw data contains single taxi rides information and thus one needs to aggregate them hourly for each community. This aggregation is done on BigQuery, once it is optimized for such task. The sliding windowing is done at the Apache Beam Pipeline and outputs a time-series for each community, containing the number of taxi rides for a given N-hours window.

Further, TensorFlow Transform is used to perform the z-norm transformation for each community time-series in order to both transform the training data and save its parameters as lookup tables to be further used at serving time.

The community area code is also stored to be used as embeddings during training. Additionally, along with the number of taxi rides, for each time step, it is also stored the hour of the day (0–23), day of the week (0–6), week of the year(0–54), month(0–11), day of the year(0–365). This way we aim to provide explicitly some contextual information to the model.

Figure 2: Normalized number of rides for communities #6, #7 and #8

Data Validation

In order to check for anomalies and perform a fast exploratory data analysis, TensorFlow Data Validation was used.

It consumes the tfrecords dataset output by the previous steps to compute the aforementioned statistics. It is up to you to perform the same analysis at the raw data, you just need to output them along with the preprocessed.

Moreover, it is up to you to code the pipeline to stop any anomaly occurs, but this was not done here.

Modeling

Although the most interesting and exciting part of a data scientist job, modeling this particular problem was surprisingly easy, probably due to the well behaved temporal recurrence characteristic of the time series throughout the hour of the day and day of the week.
The model consists of a Gated Recurrent Unit plus 2 Dense Layers.

Deployment

Once trained, the model was deployed at Google Cloud Machine Learning Engine, a fully-managed service that exposes any model through a REST API and scale horizontally seamlessly. For instance, we managed to perform ~2000 predictions per second.

Evaluation

The evaluation task comprises three steps: first, the test set data are used to make predictions and they are stored as csv files at Cloud Storage.
Afterward, two parallel steps take place: metrics calculation (to be displayed at Kubeflow UI) and plot predictions(stored at Cloud Storage and displayed at Kubeflow UI)

Figure X: Artifact output from plot-timeseries step

Conclusion

I hope that at the end of this second post, you have an overview of technical aspects about feature engineering, modeling, deployment and how they are connected throughout the pipeline.
Again, if you want to understand and run our code, clone the git repo and have some fun.