E2E Kubeflow Pipeline for time-series forecast — Part 1

Published in

CI&T

8 min readOct 24, 2019

Motivation and use case

This posts aims to provide tips and tricks on how to deploy end-to-end Machine Learning pipeline with Kubeflow on GKE for real time forecasting.

As a matter of example, it is presented the problem to forecast the number of taxi rides at the city of Chicago.

With such a solution, the taxi companies could better allocate their fleet to reduce passenger waiting time and increase drivers number productivity.

To check the source code, clone this repository.

About the Data

The Department of Business Affairs & Consumer Protection (BACP) regulates and oversees licensing, vehicle inspections and fare rates for public passenger vehicles services, such as taxicabs, pedicabs, charter buses, public chauffeurs for the city of Chicago.

BACP authorizes to collect information about taxi rides and it is done by two payment processors. Bellow, the list of the provided data:

Taxi cab ID Which taxi provided the trip
Taxi company
Trip start and end time
Trip distance
Starting and ending Community Area.
Latitude and longitude pickup and drop-off coordinates
Fare amount, tips, tolls and extra charges.
Payment type (e.g. credit card or cash)

Due to privacy issues, some of the aforementioned data are masked.

First, the taxis trips are not reported in real time, it takes a few days for the taxi rides data to be pre-processed and appears in the dataset.

Also, the trip start and end times are rounded to the nearest 15 min. For instance, if a trip starts at 9:26 am it will be saved as 9:30 am.

Moreover, instead of the license plate number, the taxi cab ID is a fake ID created specifically for this dataset.

Finally, the latitude and longitude coordinates are also anonymized, once with this information and with the exact trip’s start and end time, someone’s privacy could be compromised. Thus, latitude & longitude coordinates are not the exact pickup and drop off ones, but the center of the community area.

For more information, please read the official dataset description.

Fortunately, these privacy masking procedures are not a concern for the proposed modeling, because just the hourly taxi rides series for each community area are taken into account to perform the forecast task.

Figure 1: Map of Chicago community areas. Source: link

Exploratory Data Analysis

With the widespread dissemination of peer-to-peer ride-sharing apps such as Uber, Cabify and Lift the usage of taxi cabs car have been decreasing. From a time series modeling perspective, this is problematic once the time series turns to not be stationary. Figure 2 shows a downtrend of taxi rides per month 2013 until nowadays.

Models such as autoregressive models can’t have satisfactory performance for such type of data and thus some preprocessing needs to be done in order to remove any trend on the data. Usually the first (or as many as possible) derivative solves the problem of non-stationarity.

However, recent advances on Deep Learning shows that Recurrent Neural Networks manage to deal with such property seamlessly.

Figure 2: Monthly total number of taxi rides between 2013–2019

Once the goal is to forecast the number of taxi rides for each community area in the following hour, the first assumption one may think of is that the hour of the day and the community area itself may strongly interfere on the distribution of the data.

Areas surrounding the city center may have a higher volume of trips than the peripheral ones as well as at afternoon might concentrate higher volume than at night, for instance.

To check this, let’s take a look at the distribution of rides among the communities #6, #7 and #8 between 04/01/2019 to 10/04/2019:

Figure 3: Number of rides for communities #6, #7 and #8

Moreover, heat map at Figure 4 give us a summary of how the taxi rides are distributed.

Figure 4: Heat map of the number of taxi rides across the city of Chicago between 04/01/2019 to 04/10/2019

It can be easily seen that the order of magnitude among the rides over the areas differs a lot, in a way that only the temporal behavior of the community #8 can be easily perceived. To overcome this issue, let’s normalize each time series through z-normalization. Figure 5 shows the transformed time series.

Figure 5: Normalized number of rides for communities #6, #7 and #8

Now, the time series lies in the same order of magnitude and temporal behavior can be noticed on both of them. The minimum number of rides usually happens between 1am and 3am and the peaks happens between 8am — 9am for communities 6 and 7, and between 5pm — 6pm for community #8.

With such normalization, a machine learning model will be agnostic to the order of magnitude of the input data. This is highly important once the model will be input with time series with similar shape independently of the community area, which turns the model easier to generalize.

In order to deploy a model with such a transformation, it is indispensable to keep track of the mean and standard deviation to transform new upcoming data. This is done with TensorFlow Transform and TensorFlow lookup table, but details will be discussed further.

Although only the time series itself might be useful for forecast modeling, there are other variables that may also interfere in the number of rides, like: day of the week, day of the month, the month, holidays, weather, if a natural disaster occurred and etc.
However, the current solution is not using any external data source, but only temporal features (e.g. day of the week and day of the month) were taken into account to perform feature engineering.

Machine Learning Pipeline

Kubeflow

When talking about applied machine learning a common pitfall is to think modeling will consume most of the time of a data scientist. However, there are other concerns usually forgotten such as model deployment and monitoring, scaling infrastructure and data validation that in practice are as much important as the model itself.

Usually, such concerns are only perceived when a data scientist has done some work locally on his laptop with sample data and he is requested to scale every ML flow step.

Figure 6 illustrates a comparison between what one might expect and the reality for the effort applied for each one machine learning steps.

Figure 6:Effort comparison between expectation and reality. Source:Sculley et al.: Hidden Technical Debt in Machine Learning Systems

Given such a scenario, Kubeflow comes at hand to orchestrate the aforementioned aspects.

Kubeflow is a platform agnostic machine learning toolkit built on Kubernetes designed to orchestrate all stages of a ML workflow: data exploration & preparation, model training, serving & monitoring, and experimentation.

It makes easy to scale and reproduce the entire workflow from a laptop to a cluster deployed on cloud provider.

In order to not copy and paste the entire Kubeflow official documentation in this post, which is quite complete, feel free to read the end-to-end tutorial on GCP, which was the basis for this work. Also, take a look at the source code built for this post.

Kubeflow possess plenty of components that helps you on your experimentation cycle, such as: Metadata(artifacts management), Katib(hyperparameter tuning), Jupyter Notebooks, TFJobs(training), Seldon Serving(Model deployment, A/B Testing), NVIDIA TensorRT(optimized server for deployment both on GPU and CPU) and Pipelines(workflow orchestration).

Pipelines

Kubeflow Pipelines is a component that allows you to orchestrate your workflow by packaging each pipeline step as a docker container which communicate with each other.
The advantage of such an approach is that once you have a dockerized step, you can reuse it in any other pipeline on a plug-and-play style.

Figure 7 shows the designed pipeline for the time series forecast problem.

After you have dockerized all your pipeline’s steps, you must plug them together as ContainerOp objects. It holds the URL to a docker image to be pulled (e.g. Docker Hub, Container Registry), the command to be run and its arguments. Read our example to check how this is done.
Don’t worry, more technical details will be tackled in the second post.

TensorFlow Extended

According to TFX official web page: “A TFX pipeline is a sequence of components that implement an ML pipeline which is specifically designed for scalable, high-performance machine learning tasks. That includes modeling, training, serving inference, and managing deployments to online, native mobile, and JavaScript targets”

One has two options to use the TFX components:

Using its pre-built pipeline, in a way which the developer has to write a few pieces of code in the fill in the blanks style. Under the hood, the TFX builds a pipeline to be run in one of its orchestrators: Apache Airflow, Apache Beam, Kubeflow Pipelines. Regarding Kubeflow Pipelines, each component will be encapsulated into a ContainerOp and in the end, the compiled pipeline is output as a .tar.gz file ready to be uploaded to Kubeflow Cluster.
In this approach, each step is built using the same container image, so for preprocessing step, for instance, the container will have installed all dependencies to be used for training (e.g. CUDA and cuDNN) although not necessarily used.
Using each component as a standalone library. This was the approach used in this work. Each pipeline step, independently whether it uses a TFX component or not, was encapsulated into a ContainerOp with its own container image with the appropriate dependencies installed. The pipeline must be manually defined using the Kubeflow Pipeline Domain-Specific Language. Afterward, the compiled pipeline is output as a .tar.gz file ready to be uploaded to Kubeflow Cluster.

As a result, there is a trade-off between usability and flexibility.
Below, a brief description of the main TFX components:

TensorFlow Transform (TFT)

TensorFlow Transform allows the definition of preprocessing procedures to be used both on data transformation steps and predictions of new (and raw) upcoming data for already deployed models. The artifacts output by TFT are the transformed data to be used to train a model and a piece of TensorFlow graph to be attached to the exported saved model containing the previously defined transformations.

TensorFlow Data Validation (TFDV)

TensorFlow Data Validation analyses a given dataset for descriptive statistics computation, anomaly detection and schema inference. This allows the data scientist team to automatically detect any inconsistency that may occur among the training data or between training and new upcoming data.

TensorFlow Model Analysis (TFMA)

TensorFlow Model Analysis aims to offer detailed model evaluation metrics stratified by features (or slices). This might be useful on a time series problem to check if the model behaves differently between weekends and weekdays, or among different months or hours of the day, for example. In order to do so, one has to export a different kind of saved model, called EvalSavedModel. Finally, the model can be evaluated and the metrics exported to be further visualized on a Jupyter Notebook or on Kubeflow UI.

Conclusion

I hope that at the end of this post, you have an overview of the problem to be solved regarding the tools, the data, the feature engineering approach and the motivation to scale your ML workflow with Kubeflow at the very beginning of your project.
In the next post, it will be discussed technical details with tips and tricks about each pipeline step and how a simple RNN model manages to solve the problem reasonably good. See you there.