Migrating our machine learning platform from AWS Sagemaker to Kubernetes & Kubeflow

Sep 9, 2020 · 7 min read

By Jérémy Smadja (DevOps/SRE)

An “expected” journey…

The first recommendation infrastructure was built around three tools:

  • Apache Airflow is a well-known open-source software managing schedules and workflows.
  • AWS EMR is a managed service handling batch/stream process clusters for a large amount of data.
  • AWS Sagemaker is a managed service which aim is to put your machine learning source code to production. It will handle all the infrastructure to train your model and also the serving (automatically updating the endpoint with the new model without down-time).
First recommendation system with Sagemaker

To put it simply, Airflow is responsible to periodically trigger tasks such as preparing a new dataset using AWS EMR, launching a new training model on Sagemaker and ask Sagemaker to expose the newly generated model online (serving).

Airflow workflow managing the entire pipeline from a new dataset to the inference endpoint deployment

Seems perfect, but…

  • To handle things automatically, you have to respect the documented format required by Sagemaker. This is not always convenient when you want to deeply change the training process, also you will need to pack all your processes in one container ( in our case the HTTP server , the reverse-proxy, the metrics agent .. )
  • In 2019, the generated model had to be smaller than 30 GB to fit into the available “ml” instance type serving your model (nowaday AWS adds a lot more instance types).
  • “ml” instances are way more expensive than usual EC2 instance types (>30%).

Additionnally we want to help our data scientists to be more effective by allowing more reproducibility, especially between their development environment and the production one: same code, same container, same power, same size…

Here comes a new challenger!

Kubeflow is presented like a big tool box aggregating multiple pieces of software (open-source projects or self-developed) to facilitate the work of data scientists :

  • Pipelines using Argo CD where each step is a “component”, i.e a pod/container running into your Kubernetes cluster.
  • A Python SDK allowing for example to transform any function into a component to be used in a pipeline.
  • Hyper parameter tuning using Katib.
  • Serving with KFServing, Seldon or TensorFlow.
  • A lot of integration for common machine learning frameworks: TensorFlow, PyTorch, MPI…
  • Metadata storage and visualization
  • Jupiter Notebook Server

Well, all seems good, but as explained earlier, it is on Kubernetes (AWS EKS), so don’t think you can easily jump into Kubeflow, create a pipeline and expose your model just like that.

There are a lot of infrastructure considerations under the hood, and consequently you will likely need an Admin/DevOps/Data engineer guy to make things work. If you don’t have the resource, maybe you should consider using a managed service like AWS Sagemaker or AWS Personalize.

The main pipeline

Kubeflow main pipeline

The prepare part uses three components : create_emr_cluster, submit_emr_job, delete_emr_cluster. Something great about the component concept is that you can easily re-use them in any pipeline, meaning that our data scientists can create an EMR cluster without knowing all the stuff related to the configuration, and it’s the same configuration as the production environment. Something even more awesome is that we don’t even have to create these components from scratch because the community has already done that (Github).

As the training process was already a container used by Sagemaker, we had to clean some Sagemaker specific code and handle the S3 download and upload of the model ourselves. Next we had to create a Kubeflow component YAML file (to create an interface between the container and the python SDK operator) to include the train step in our pipeline.

The prepare step runs our code outside of our Kubernetes cluster : on AWS EMR. So the K8S worker which runs your components (pods) must have the right to manipulate AWS EMR (create, submit, delete).

The train component is a pod running inside our Kubernetes cluster, using one specific powerful and pricy instance type. So you must ensure your cluster creates these instances when it’s necessary, schedules your pods on these workers and not on workers hosting Kubeflow pods, and scales down your instances when the job is done (hello my dear Tech friend 🙂).

Finally for the serving part we are still not using the integrated KFServing or Seldon. At this moment, the component only deploys classic Kubernetes YAML files such as : Deployment, Service, Ingress kind. Whether you are using pure Kubernetes or a Serving framework, again, you must call your Tech guy to bring things together: configure the load balancer, your auto-scaling, your monitoring…

Speaking about monitoring and auto-scaling, as Sagemaker is a managed service, it manages that kind of stuff for you. For example you can easily auto-scale your inference API based on its response time.

On Kubernetes you can basically use the CPU or the memory usage and thankfully external metrics. As our cluster is monitored by Datadog, it natively provides external metrics to be used in your HorizontalPodScaler configuration! Yes, all metrics sent to Datadog can be used from within your cluster to trigger your pod auto-scaling (Datadog documentation).

For the Tech guy ❤️

Configure your resource requests correctly.

Configure the Deployment liveness and readiness probe.

Avoid frontend API calls failures during your inference pod Rolling Update

  • Configure correct timeout and retry between frontend and backend.
  • Add a preStop task and a terminationGracePeriodSeconds in your Deployment to warn the load balancer it must stop sending request before the actual SIGTERM.

Let the HPA handles replicas count

  • Here is a great article on Medium which explain this matter (medium.com)

Increase the reactivity of your load balancer health checks

If you are using aws-alb-ingress-controller, set it to “IP” mode instead of the default “instance mode”

Modify the native AWS-CNI Deployment to decrease the number of reserved IP per worker.

Be careful with the “AWS cross-zone load balancing” when your workers are in multiple availability zones.

  • It is always enabled with an ALB
  • solution (cluster-autoscaler FAQ): add — balance-similar-node-groups in cluster-autoscaler and create one ASG per AZ
  • Or use an ELB and disable this feature
  • Or use the lifecycle hook in your ASG to gracefully terminate your pods using a drainer in a Lambda function

Finish him!

Using classic instance types instead of “ml” types, reserved and spot instances, we did around 30–40% cost saving.

Auto-scaling is precisely triggered using our business metrics.

Data scientists can reproduce the production workflow without limitation or complexity.

The CI/CD run unit tests, version containers and publish them on ECR.

The community around Kubeflow keeps growing and AWS is a part of it… like us.

leboncoin Engineering Blog

Learn more about creative engineers & data scientists building a French virtual Flea Market