Machine learning (ML) is the application of artificial intelligence (AI) through a family of algorithms that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. They have the potential to approximate linear and non-linear relationships, by extracting more information from a data model to achieve higher accuracy.
Python is the language of reference for ML and many libraries provide off-the-shelf machine learning algorithms, like scikit-learn, XGBoost, LightGBM or TensorFlow. For those reasons, Python is the chosen language of dunnhumby for Data Scientists.
However, over the years, we’ve found a few inconveniences with ML:
- You don’t know in advance which algorithm will give the best results, therefore you have to try several ones
- The learning process of the model is controlled through hyperparameters that the user has to tune. There is no formula to calculate them, so it can require hundreds of iterations for each model before finding the good ones
- Their flexible framework put them at risk of overfitting, which is when the model starts to learn the noise in the data. This can be prevented using cross-validation, during which models are trained and evaluated on different subsets of data several times
- Most of them require significant amount of CPU, RAM and sometimes GPU in order to be applied efficiently
Long story short, due to the number of algorithms available, their hyperparameters and the cross-validation, a data scientist might have to create thousands of models before reaching a good outcome. With 400 analysts and data scientists, we needed a solid platform managing efficiently our resources to allow them to use ML in their work the way they wanted in a short timeframe.
Parallel processing is the opposite of sequential processing. By splitting a job in different tasks and executing them simultaneously in parallel, a significant boost in performance can be achieved.
The project Kubeflow is meant to make deployments of machine learning workflows on Kubernetes simple, portable and scalable. However, at the time of writing, it is still at an early stage and not all the open-source libraries like scikit-learn that we use at dunnhumby have been implemented.
To achieve our objective, we needed a platform that allows:
- To run any ML algorithm of our choice in parallel
- Large datasets to be processed
- Multi-tenancy, so different users can request simultaneously ML pipelines
- An efficient resource management system for the cluster
To run all the ML models in parallel, we started by creating a docker image that contains all the ML libraries that we use at dunnhumby. This docker container has a very simple role: given the location of the data to use, the model name, and the hyperparameters to use, it fits the model on the data and returns the predictions.
Below is a very simple example of a Dockerfile that can be used to create a similar image:
We chose Kubernetes to build our cluster, which is an “open-source container-orchestration system for automating deployment, scaling and management of containerized applications”.
Containerisation is an alternative to full virtualisation, where an application runs in a container with its own operating system. This enables the developers to run the application on any environment without having to worry about dependencies. By containerising the fitting of each model, one can run an entire ML gridsearch in parallel on Kubernetes as the containers are independent from each other.
The ability to run models in parallel results in a significant boost in term of performance compared to a sequential approach, as well as allowing us to manage resources more efficiently. Kubernetes allows the resource management team to allocate different amount of memory and CPU for each container, which means we can allocate more resources for complex algorithms like XGBoost that rely on multi-threading compared to single threaded algorithms like LogisticRegression from sklearn. This allows us to make the most of our resources, keep the cost down, and run more pipelines simultaneously.
Due to our large number of users, our platform needs to be multi-tenant. When a user needs to run many ML models, the job is sent to a broker that puts it in a queue. This broker will be in charge of maintaining the requests as well as returning the results of the pipeline to the user.
We work with large datasets, so we needed an efficient way to allow containers to access the data. We also take data privacy extremely seriously, so we had to make sure that the process was secured.
When a job is created, the datasets are serialised and passed as pickle files through RPC calls to a shared location. Only the containers created for a particular job can access this location.
To control and manage the creation of docker containers on the Kubernetes cluster and the appropriate allocation of resource for each model, we created a scheduler component. Its job is to ensure that all the models are properly created and monitor the health of the cluster. If an error occurs, the scheduler is responsible for returning the error and logs to the users and clean-up.
The Final Result
We benchmarked our platform with an ML pipeline previously created by a data scientist. She wanted to try 5 XGBoosts, 5 RandomForests and 5 L1-regularised Logistic Regressions (all sklearn), with a 5-fold cross-validation. That’s equivalent to creating 75 models, and it took in total almost 8 hours when run sequentially. With our new platform, the same pipeline took less than 1h; that’s 8 times faster!
We have now deployed this platform and made it available to all our analysts and data scientists, and can run both on-premise or in cloud.
The Next Steps
One issue that remains is the appropriate provisioning of resource allocated for each container. Some models are single threaded and only require 1 CPU, but some are multi-threaded and will require more CPUs if you want the process to be completed in a short time frame. Also, the amount of RAM required is a function of the size of the dataset.
The simplest way is to always allocate the same resources, regardless of the size of the dataset and the model. However, this will result in an misutilisation of our resources, because some jobs that require very little resources will be given more than what they need, and some jobs requiring a large number of CPUs and RAM won’t be given enough.
Tailoring the number of CPUs and RAM for each container based on both the ML algorithm and the size dataset is not trivial, but we are currently working on a new component for our stack that relies on ML to predict what amount of resources should be allocated for a job, but also the probability of getting an error and the expected runtime to completion. As of now only the latter has been implemented, meaning we can successfully predict the runtime of a pipeline with a precision of ±30s.