To provide the best user experience, Dailymotion relies a lot on machine learning algorithms and AI which are behind its recommendation and video tagging engines. So being able to smoothly build, train and deploy new models is one of our main concerns as data engineers. Let’s see how Apache Airflow and Google Kubernetes Engine play so well together and allow us to achieve that goal.
The complex life cycle of a machine learning model
To get a sense of the challenges to bring a model from exploration to production, let’s have a small reminder of the different steps in the life cycle of a machine learning model.
As shown in the above chart, we have different people involved, data scientists and data engineers as well as different environments having their own requirements and purposes.
Data scientists are more focused on model performance and will iterate over the process in development environment, whereas data engineers are more concerned by workflow automation and reliability in production. Depending on the organisation, we can either have data scientists or data engineers to schedule the pipelines, but data engineers are the ones who are in charge of providing the framework to do so. While designing this framework, we wanted to ensure that data scientists would have the freedom to use whatever tools they want and computation resources they need.
Overall, the different stages remain almost the same in exploration and production, so we can reuse the code written in exploration when moving to production to avoid any discrepancies and to save time. At Dailymotion, we are mainly using Google Cloud Platform to store our data and to build services on top of them. Apache Airflow is our main scheduler and we set it up in Google Kubernetes Engine (GKE). We will see how both helped us to meet those requirements.
In this article, we are going to focus on the technical part of the framework. If you are interested on a full overview of our collaboration between data engineers, analysts and scientists, you can checkout Germain Tanguy’s article:
Collaboration between data engineers, data analysts and data scientists
How to efficiently release in production?
To better understand how they help us to achieve our objective, let’s first see the different solutions we went through before getting to this last solution.
Our first tries in scheduling ML pipelines
At the very beginning, the team was pretty small with very few data engineers. For one of our first ML model, the data extraction and preprocessing parts which consisted in BigQuery jobs were easily moved to production and scheduled with airflow using BigQueryOperator. However when it came to model training, it was a little bit more tricky as it required specific libraries and machines. Due to lack of engineering resources and because of short deadlines, the data scientist had to bring this part to production on his own without a well defined framework. By convenience, the needed docker image was created and run in GKE. The model training was then scheduled by a cron like daemon set up in the same pod as the training application. This worked fairly well, docker images are environment agnostic and enable the use of different libraries, while GKE provides the needed computation resources. The solution seems good enough, but in fact, it has several drawbacks:
- The GKE cluster is always up most of the time doing nothing if models training are scheduled every day for example.
- We are missing the expected features of a good scheduler such as retry on failures, backfill capabilities and more complex trigger rules.
Later on, the team grew a lot and we had more and more jobs scheduled in Airflow. So we naturally decided to move all of our ML tasks there as well. At that time airflow KubernetesPodOperator was not available yet, but DockerOperator was, so we relied on it to run our jobs on Google Compute Engine (GCE). The workflow can be described as follow:
But still we were not fully satisfied with the solution as:
- We have to create a VM instance and delete it afterwards explicitly through airflow custom operators.
- There was not much reusability: rights scope are redefined for each instance and machine resources are not shared (one instance per docker image).
- We lose the capability to run a small cluster and share services such as a metrics collector agent like Datadog for example.
When KubernetesPodOperator was released a few weeks later, we were excited to try it out, we believed it would help us to fully leverage Kubernetes capabilities.
Where Airflow and Kubernetes make the difference
The idea was to make use of GKE node pools, each pool will provide a type of machine (high memory, high cpu, GPU…) and a set of scopes with auto-scaling enabled. We can then submit our ML jobs in the different pools from airflow using KubernetesPodOperator:
Running our ML tasks in Kubernetes with node pools has several advantages:
- Don’t need to manually manage the instances creation and destruction: if there is no pod run on a pool it will downscale automatically to zero and if more resources are needed more nodes will get created.
- Computation resources can be shared. For example, we can have a pod running on a node for a training job. If the node has enough resources, another pod (most likely a lightweight job) can be scheduled on it, avoiding the extra cost (time and money) to create another instance.
- Services can be shared across all the pods in the cluster. For example we can set up a single Datadog agent service to collect the metrics of our different ML jobs.
Node pools creation
We decided to create our node pools on the same cluster as our Airflow instance, it makes things a little bit easier: we already have some services which might get reused. Creating a node pool is fairly easy with GKE:
The command is pretty straightforward, one important thing is that we are setting
--node-taints on the nodes to prevent any pod to run on a node pool inadvertently. Only pods having the associated toleration can get scheduled on it. Indeed, it could be a waste to have a pod requiring very few resources to trigger a scale up on a node pool offering machine with GPU.
Once the node pool is created, we can use airflow KubernetesPodOperator as follow:
Again it is pretty straightforward, but still let’s go through some interesting parameters:
node_selectors: tells on which node the pod should be run on. We specify here nodes from a specific pool.
in_cluster: tells if the target cluster is the same as the one where airflow is running. We are using here a dynamic parameter
IN_CLUSTERwhich takes different values depending on the environment where airflow is run. For local test, we will deploy the pods in our airflow dev cluster, so the value will be
CLUSTER_CONTEXTwill be equal to our dev cluster context.
tolerations: sets the required tolerations. Here we are providing the associated toleration for the node pool taint.
xcom_push: enable the pod to send a result to the airflow worker. We need to write the data we want to return in a json file located in :
/airflow/xcom/return.json. Under the hood, KubernetesPodOperator mount a volume and use a sidecar container which will read the file and output it on the stdout, which is then captured and parsed by the worker.
As data engineers, we just needed to set up the pools and to create a few variables/methods to add a layer of abstraction. Our data scientists can now create their own dag leveraging the node pools very easily. So at the end of the day, airflow KubernetesPodOperator enables us to fully take advantage of Kubernetes features such as nodes auto-scaling and resources/services sharing, while scheduling our ML tasks on airflow.
In our journey to provide the most efficient framework to build and serve ML models, there are still many things we want to take a look at:
- KubernetesExecutor: for people who are wondering why we were not moving to KubernetesExecutor instead of using KubernetesPodOperator? Well, I believe the first one tackles a different purpose where Airflow workers are doing heavy work and need to scale up. This was not really our case at the moment as our workers are mainly submitting jobs to be run on Google BigQuery, Dataflow… but most importantly moving to KubernetesExecutor requires more work than just using a new operator. However we still might consider it for the future as it provides many advantages, for example when running lots of backfill we are stressing out our workers and KubernetesExecutor might be a good solution.
- Kubeflow: it offers a whole platform for ML workflow from exploration to serving and aims to make it simpler and more consistent. Our current framework does not fully cover the serving part, so we definitely want to check it out.
Apache Airflow plays very well with Kubernetes when it comes to schedule jobs on a Kubernetes cluster. KubernetesPodOperator provides a set of features which makes things much easier. For our case it enables us to deliver ML models faster into production and ultimately to enhance the user experience with more data oriented features. Of course we could achieve the same results by taking our first solutions and by adding the missing features by ourselves, but at what cost? Here we have a neat built-in solution which perfectly answer our needs, so let’s make use of it.