Migrating our machine learning platform from AWS Sagemaker to Kubernetes & Kubeflow
By Jérémy Smadja (DevOps/SRE)
An “expected” journey…
33 000 000 is the approximate number of ads available on leboncoin.fr. Because a classic user search cannot always be matched with perfectly relevant ads during a session, it is very helpful to build a recommendation system to automatically expose ads based on the user ad view history. Early 2019 is the beginning of this new machine learning project, and leboncoin was on its way to move its entire infrastructure to AWS.
The first recommendation infrastructure was built around three tools:
- Apache Airflow is a well-known open-source software managing schedules and workflows.
- AWS EMR is a managed service handling batch/stream process clusters for a large amount of data.
- AWS Sagemaker is a managed service which aim is to put your machine learning source code to production. It will handle all the infrastructure to train your model and also the serving (automatically updating the endpoint with the new model without down-time).
To put it simply, Airflow is responsible to periodically trigger tasks such as preparing a new dataset using AWS EMR, launching a new training model on Sagemaker and ask Sagemaker to expose the newly generated model online (serving).
Seems perfect, but…
As the project was moving on, new features were added and the team was quickly faced with some AWS Sagemaker limitations:
- To handle things automatically, you have to respect the documented format required by Sagemaker. This is not always convenient when you want to deeply change the training process, also you will need to pack all your processes in one container ( in our case the HTTP server , the reverse-proxy, the metrics agent .. )
- In 2019, the generated model had to be smaller than 30 GB to fit into the available “ml” instance type serving your model (nowaday AWS adds a lot more instance types).
- “ml” instances are way more expensive than usual EC2 instance types (>30%).
Additionnally we want to help our data scientists to be more effective by allowing more reproducibility, especially between their development environment and the production one: same code, same container, same power, same size…
Here comes a new challenger!
Our stack was already fully containerized, so our data- scientists and data-engineer were already able to reproduce the production workflow locally using a docker-compose. This is still used and very convenient for local development. Thank to that, in the end of 2019, we already projected to convert the docker-compose file to Kubernetes files and move out from Sagemaker. Fortunately there is a great open-source project called Kubeflow, specifically designed for machine learning workflow on Kubernetes \o/.
Kubeflow is presented like a big tool box aggregating multiple pieces of software (open-source projects or self-developed) to facilitate the work of data scientists :
- Pipelines using Argo CD where each step is a “component”, i.e a pod/container running into your Kubernetes cluster.
- A Python SDK allowing for example to transform any function into a component to be used in a pipeline.
- Hyper parameter tuning using Katib.
- Serving with KFServing, Seldon or TensorFlow.
- A lot of integration for common machine learning frameworks: TensorFlow, PyTorch, MPI…
- Metadata storage and visualization
- Jupiter Notebook Server…
Well, all seems good, but as explained earlier, it is on Kubernetes (AWS EKS), so don’t think you can easily jump into Kubeflow, create a pipeline and expose your model just like that.
There are a lot of infrastructure considerations under the hood, and consequently you will likely need an Admin/DevOps/Data engineer guy to make things work. If you don’t have the resource, maybe you should consider using a managed service like AWS Sagemaker or AWS Personalize.
The main pipeline
First steps on Kubeflow, we decided to migrate with a “lift and shift” approach, which means that we will reuse all our current containers and simply use them in a Kubeflow pipeline. Looking at the previous workflow, here is the representation of the pipeline on Kubeflow.
The prepare part uses three components : create_emr_cluster, submit_emr_job, delete_emr_cluster. Something great about the component concept is that you can easily re-use them in any pipeline, meaning that our data scientists can create an EMR cluster without knowing all the stuff related to the configuration, and it’s the same configuration as the production environment. Something even more awesome is that we don’t even have to create these components from scratch because the community has already done that (Github).
As the training process was already a container used by Sagemaker, we had to clean some Sagemaker specific code and handle the S3 download and upload of the model ourselves. Next we had to create a Kubeflow component YAML file (to create an interface between the container and the python SDK operator) to include the train step in our pipeline.
The prepare step runs our code outside of our Kubernetes cluster : on AWS EMR. So the K8S worker which runs your components (pods) must have the right to manipulate AWS EMR (create, submit, delete).
The train component is a pod running inside our Kubernetes cluster, using one specific powerful and pricy instance type. So you must ensure your cluster creates these instances when it’s necessary, schedules your pods on these workers and not on workers hosting Kubeflow pods, and scales down your instances when the job is done (hello my dear Tech friend 🙂).
Finally for the serving part we are still not using the integrated KFServing or Seldon. At this moment, the component only deploys classic Kubernetes YAML files such as : Deployment, Service, Ingress kind. Whether you are using pure Kubernetes or a Serving framework, again, you must call your Tech guy to bring things together: configure the load balancer, your auto-scaling, your monitoring…
Speaking about monitoring and auto-scaling, as Sagemaker is a managed service, it manages that kind of stuff for you. For example you can easily auto-scale your inference API based on its response time.
On Kubernetes you can basically use the CPU or the memory usage and thankfully external metrics. As our cluster is monitored by Datadog, it natively provides external metrics to be used in your HorizontalPodScaler configuration! Yes, all metrics sent to Datadog can be used from within your cluster to trigger your pod auto-scaling (Datadog documentation).
For the Tech guy ❤️
Here is some considerations when migrating your classic machine learning containers to a Kubernetes environment, especially if you want to enable auto-scaling:
Configure your resource requests correctly.
- Sometimes your data scientists need an entire worker for their training process.
Configure the Deployment liveness and readiness probe.
- If the inference pod takes several minutes to be ready, be sure to delay those probes to avoid unnecessary restart.
Avoid frontend API calls failures during your inference pod Rolling Update
- Gracefully stop containers by implementing a good SIGTERM handling, with an init container or by the program itself.
- Configure correct timeout and retry between frontend and backend.
- Add a preStop task and a terminationGracePeriodSeconds in your Deployment to warn the load balancer it must stop sending request before the actual SIGTERM.
Let the HPA handles replicas count
- do not set the replicas field in the Deployment file or it will force the replica count each time you trigger a Rolling Update.
- Here is a great article on Medium which explain this matter (medium.com)
Increase the reactivity of your load balancer health checks
If you are using aws-alb-ingress-controller, set it to “IP” mode instead of the default “instance mode”
- it will decrease the latency (only one hop) and avoid routing traffic to irrelevant workers = pod location awareness :o
Modify the native AWS-CNI Deployment to decrease the number of reserved IP per worker.
Be careful with the “AWS cross-zone load balancing” when your workers are in multiple availability zones.
- AWS can suddenly shut down a worker to balance instances between AZ. Kubernetes is not aware of that behavior, it’s like losing an actual worker!
- It is always enabled with an ALB
- solution (cluster-autoscaler FAQ): add — balance-similar-node-groups in cluster-autoscaler and create one ASG per AZ
- Or use an ELB and disable this feature
- Or use the lifecycle hook in your ASG to gracefully terminate your pods using a drainer in a Lambda function
Well, there are still a lot of work to accomplish like Serving and A/B testing, but now this first iteration is in production and we are proud of it.
Using classic instance types instead of “ml” types, reserved and spot instances, we did around 30–40% cost saving.
Auto-scaling is precisely triggered using our business metrics.
Data scientists can reproduce the production workflow without limitation or complexity.
The CI/CD run unit tests, version containers and publish them on ECR.
The community around Kubeflow keeps growing and AWS is a part of it… like us.