Training and deploying machine learning models on GCP ML-Engine using Tensorflow Estimators

What is Google Cloud ML-Engine

Image result for ml engine

Google Cloud ML-Engine is a useful tool to train and deploy your machine learning models on cloud for serving purposes. ML-Engine is a managed services offered by google that enables developers and data scientists to build and bring their models to production. It offers training and predictions services. Training and Online Prediction allow developers and data scientists to use multiple ML frameworks, and seamlessly deploy ML models into production — no Docker container required. Users can also import models that have been trained anywhere.

1. Training

Part 1 → Getting the dataset

The dataset I am using here is Kaggle’s House Price Prediction dataset. You can download it from here.

Part 2 → Creating the python package

File tree showing the files we need to create.
  1. setup.py

When you run a training job on Cloud-ML, the jobs runs on training instances that have many common python packages already installed which are called standard packages. But there are some custom packages which are not pre-installed and if you are using those packages you will need to specify those packages somewhere so that it could be installed on the instance. You define those packages in setup.py. A simple setup.py file will look like this.

2. task.py

  1. Reads and parses model parameters, like location of the training data and output model, # hidden layers, batch size, etc.
  2. loads data from the location specified and applies preprocessing logic.
  3. Calls the model training logic located in model.py with said parameters.

I have defined four functions in this file.

  • download_files_from_gcs → the function is used to download data files from gcs.
  • get_args → it defines all the arguments required from the user such gcs-job-dir, data-file-location, etc.
  • load_data → this functions first downloads the data from gcs. All the preprocessing is done in this function. the function returns train-test data.
  • train_and_evaluate → The function defines all the estimator specs and exporter information and finally calls tf.estimator.train_and_evaluate function to start the training job.

3. model.py

In this file, minimum of 3 functions are needed to be defined. They are →

  1. model_fn() or keras_estimator(). For the sake of this example, I am using keras_estimator(). Here you will define your complete model architecture. The function needs to return an Estimator instance of your compiled keras model.
  2. input_fn() which is used for passing input to your model. We will use tensorflow’s dataset api to create a dataset iterator. It returns a dataset iterator object.
  3. serving_input_fn() defines the features to be passed to the model during inference, for ex-> Tensorflow placeholders. It takes no arguments and returns an InputFnOps (tf.estimator.export.ServingInputReceiver).

Now your python package is ready. Now lets create a training job.

2. Training

I. Training Locally

Before actually submitting the training job on the cloud, you can test your package locally on a dummy dataset and check for any errors and debug it. To start a training job on your local machine, you can either use python or gcloud command.

a. Using Python.
$ export JOB_DIR=/path/to/the/dir/
$ rm -rf $JOB_DIR
$ export TRAIN_FILE=/path/to/training/file
#TRAIN_FILE could either be a path to your local file or a gcs location.
$ python -m trainer.task \
--train-file=$TRAIN_FILE \
--job-dir=$JOB_DIR
b. Using gcloud command line tool
$ gcloud ml-engine local train --module-name=trainer.task \
--package-path=trainer \
--train-file=$TRAIN_FILE \
--job-dir=$JOB_DIR

II. Submitting the job to the cloud

If everything works fine, you can submit a training job to the cloud ML-Engine. You will need to move your training data to GCS bucket. After the job is submitted, it will start the training of your model using the dataset you provided and will save the model checkpoints and tensorflow SavedModel to the GCS_JOB_DIR which you can then use to deploy the model for serving purposes. To submit a training job to the cloud, run the following command.

#ENVIRONMENT VARIABLES
$ export JOB_NAME=housing_job_1
$ export GCS_JOB_DIR=gs://cloud-ml-job-bucket
$ export TRAIN_FILE=gs://cloud-ml-data-storage-
bucket/kaggle_housing_prices.csv
$ export REGION=us-central1

You can check the status of the job and the logs on the ml-engine dashboard. When the job is finished, you can check the GCS_JOB_DIR bucket for the model checkpoints and the SavedModel which is written to bucket/export/exporter/{temp_variable}/. The folder will contain something like this.

3. Deploying

Now as your model is successfully trained, it is time to deploy your model to production so that other people can use that model.

For deploying your model, you will need to follow this 2 steps.

Note. All the steps defined below can also be done directly from ML-Engine dashboard.

Step 1 → Creating the model
#ENVIRONMENT VARIABLES
$ export MODEL_NAME kaggle_housing_price_prediction
$ export MODEL_PATH=gs://cloud-ml-job-
bucket/export/exporter/1545905371
#CREATE MODEL
$ gcloud ml-engine models create $MODEL_NAME
Step 2 → Creating model version
$ gcloud ml-engine versions create "version_1" --model $MODEL_NAME -
-origin $MODEL_PATH

You have now successfully deployed your model on ML-Engine.

4. Predictions

For serving predictions, you will need to prepare your data and export it to a json file so that it can be used by our deployed model.

with open('test_data.json', 'w') as outfile:
json.dump(test, outfile)

Now run this command to make predictions

$ gcloud ml-engine predict --model kaggle_housing_price_prediction - 
-version version_1 --json-instances test_data.json

Wrapping Up

I have explained the steps to define, train and deploy your custom machine learning models on cloud ml-engine. Overall the steps are pretty straightforward. I think GCP ML Engine is surprisingly accessible and flexible. Everything starting from loading the dataset, preprocessing the data, training the model and exporting the trained model, all happens on the ml-engine which auto-scales based on the resources required. No matter how big our preprocessing computations are or how big our model or dataset is, all is completely managed by the ML-Engine.

Links