Serverless machine learning using Docker
Running containers in Google AI Platform
For this post, I assume that the reader knows what Docker is. If you don’t know anything about Docker you can start here
Why use Docker in Data Science
Docker is a great tool for building, shipping and running containers. It allows to run applications in different infrastructure (for example AWS, GCP, K8, etc) without taking too much effort. You can run and test your applications locally and then deploy and scale them with ease.
Usually, in a Data Science project a general workflow is followed: extract the data and explore it, and then train and test multiple models until you get the expected results. After that, it’s a good idea to package the model training process so that you can easily retrain the model or scale the process up.
Here is where we can use Docker to make this process simpler. The main idea of this post is to show how to package a model’s training process using containers and run them in Google Cloud Platform.
Prepare the data
At Spike, we usually feed our model’s training processes with data that is hosted in Google BigQuery, which is a powerful data warehouse that allows to manipulate very large amounts of data quickly and easily.
For this post, I uploaded the wine quality dataset into BigQuery. This dataset includes a series of features that can (hopefully) determine the quality of the wine.
Our goal will be to train a model that predicts the wine quality. For practical reasons we won’t focus on the quality of the model in this post, so I’ll just skip this evaluation.The dataset in BigQuery looks as follows:
Create a service account
The next step is to create a service account file and assign the required roles to read BigQuery tables. In production, this allows to restrict the access of the service account to specific datasets contained in the BigQuery project. For each service account, a json file is created. We then need to download this file.
The previous step is needed uniquely in the case when you extract data from BigQuery. If you use a different DataWarehouse, you will need to set up the required connection to load your data.
Training our model
For this example, our training script will load the data from BigQuery, then train a GradientBoostingRegressor
using parameters defined by the user, and finally it will log some model metrics.
In a real ML problem we can also run a GridSearch for parameter tuning, export the model into Google Cloud Storage, etc.
Building the Docker image
Let’s start by adding a requirements.txt file with the required dependencies.
pandas
pandas-gbq
scikit-learn
google-cloud-bigquery-storage
google-api-python-client
fastavro
tqdm
Now let’s add a Dockerfile:
#base image
FROM python:3#copy main.py, requirements.txt and service_account.json into #trainer folder
COPY . /trainer#set the trainer folder as working directory
WORKDIR /trainer#install dependencies using pip
RUN pip --no-cache-dir install -r requirements.txt#define our image entrypoint
ENTRYPOINT ["python", "main.py"]
The file structure should look like this:
docker_example/
├── Dockerfile
└── main.py
└── requirements.txt
└── service_account.json
Now we can build our docker image running as follows:
docker build -t docker-model-training .
After a few minutes, the docker image will be created and if everything is ok, a message like this will be prompted:
Successfully built abfd807885a6
Successfully tagged docker-model-training:latest
Now we can run it locally:
docker run docker-model-training:latest --ntrees 100 \
--learning_rate 0.01 \
--subsample 0.8 \
--max_depth 5 \
--project_id project
And the output should look like this:
Running the container in Google Cloud Platform
We are going to train our model in Google AI Platform, which allows us to train models without worrying about managing servers or clusters. We only need to define the machine type, and send a job into the platform with our training code, and Google does the rest…Perfect!.
Push our image into Google Container Registry.
docker build -t gcr.io/[PROJECT_ID]/docker-model-training .
docker push gcr.io/[PROJECT_ID]/docker-model-training
Now we can submit a training job into AI Platform with the following command.
gcloud ai-platform jobs submit training [JOB_NAME]\
--region us-east1 \
--master-image-uri gcr.io/[PROJECT_ID]/docker-model-training \
-- \
--ntrees 100 \
--learning_rate 0.01 \
--subsample 0.8 \
--max_depth 5
--project_id [GCP_PROJECT_ID]
The model’s logs/outputs can be found in StackDriver Logging, or alternatively we can stream the logs into the console:
Conclusions
Training ML models using Docker allows to scale them easily, and makes our training scripts portable. We can run them locally or in AWS, Azure, GCP, Kubernetes, etc.
In Google Cloud Platform you can easily submit a training job and switch between different machine types.The kind of machine depends of your problem and the size of your data. Additionally, you can run your training script using a GPU or TPU.