E2E Kubeflow Pipeline for time-series forecast — Part 2

Rodrigo Pereira
Dec 19, 2019 · 4 min read

This is the second and final post related to how to build an e2e pipeline for time-series forecast using Kubeflow.
Feel free to read the first post if you haven’t yet.
It covers the motivation to solve such problem, a bit of exploratory data analysis(EDA) and technical aspects of Kubeflow.

This post will present an overview of every Kubeflow pipeline components designed for the proposed solution.
Feel free to take a look at the source code.

Pipeline components

Kubeflow provides a Python DSL to define each component as a ContainerOp, in a nutshell, each pipeline component is defined as a docker container.
When you define a ContainerOp, you must provide:

  • a container image URL
  • entry command
  • script arguments
  • file outputs (if any) to do IO across components
  • output artifact paths, to export metrics or HTML widgets
Figure 1: Source code of pipeline definition.

Once with the pipeline defined, it can be built and a tar.gz file will be created, When uploaded at Kubeflow UI, the pipeline will look like this:

Figure 2: Designed pipeline for time-series forecast

To run an experiment, you just need to click at Create Run button and enter the pipeline parameters like this:

Figure 3: Kubeflow experiment UI

Now let’s take an overview of every pipeline steps

Preprocessing

The preprocessing components are designed to primarily process the input data stored at BigQuery into tfrecords shards to be further consumed at the training step.
The preprocessing part of the pipeline is split into the following steps: read-metadata, bq2tfrecord, data-valitation,

The first preprocessing step is responsible to list both the eligible community areas and the z-norm statistics (mean and standard deviation) for each community. These outputs are fed into the bq2tfrecord step.

This step is built upon Apache Beam and currently can be run over DirectRunner or DataFlowRunner.

The second preprocessing step reads all records from BigQuery and outputs a tf.Example per windowed data.

The raw data contains single taxi rides information and thus one needs to aggregate them hourly for each community. This aggregation is done on BigQuery, once it is optimized for such task. The sliding windowing is done at the Apache Beam Pipeline and outputs a time-series for each community, containing the number of taxi rides for a given N-hours window.

Further, TensorFlow Transform is used to perform the z-norm transformation for each community time-series in order to both transform the training data and save its parameters as lookup tables to be further used at serving time.

The community area code is also stored to be used as embeddings during training. Additionally, along with the number of taxi rides, for each time step, it is also stored the hour of the day (0–23), day of the week (0–6), week of the year(0–54), month(0–11), day of the year(0–365). This way we aim to provide explicitly some contextual information to the model.

Figure 2: Normalized number of rides for communities #6, #7 and #8

Data Validation

In order to check for anomalies and perform a fast exploratory data analysis, TensorFlow Data Validation was used.

It consumes the tfrecords dataset output by the previous steps to compute the aforementioned statistics. It is up to you to perform the same analysis at the raw data, you just need to output them along with the preprocessed.

Moreover, it is up to you to code the pipeline to stop any anomaly occurs, but this was not done here.

Modeling

Although the most interesting and exciting part of a data scientist job, modeling this particular problem was surprisingly easy, probably due to the well behaved temporal recurrence characteristic of the time series throughout the hour of the day and day of the week.
The model consists of a Gated Recurrent Unit plus 2 Dense Layers.

Deployment

Once trained, the model was deployed at Google Cloud Machine Learning Engine, a fully-managed service that exposes any model through a REST API and scale horizontally seamlessly. For instance, we managed to perform ~2000 predictions per second.

Evaluation

The evaluation task comprises three steps: first, the test set data are used to make predictions and they are stored as csv files at Cloud Storage.
Afterward, two parallel steps take place: metrics calculation (to be displayed at Kubeflow UI) and plot predictions(stored at Cloud Storage and displayed at Kubeflow UI)

Figure X: Artifact output from plot-timeseries step

Conclusion

I hope that at the end of this second post, you have an overview of technical aspects about feature engineering, modeling, deployment and how they are connected throughout the pipeline.
Again, if you want to understand and run our code, clone the git repo and have some fun.

CI&T

CI&T combines strategy, design and engineering expertise, working cross-functionally to deliver lasting impact to our clients.

Rodrigo Pereira

Written by

CI&T

CI&T

CI&T combines strategy, design and engineering expertise, working cross-functionally to deliver lasting impact to our clients.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade