Machine Learning using Google Cloud ML Engine.

Gautam Karmakar
6 min readSep 23, 2018

Introduction: This document provides introductory end to end walkthrough of a model training, predicting using both batch and online prediction using Google Cloud ML engine. It will be clear after this walk through how to do the following:

· Create a TensorFlow training application and validate it locally

· Run training job in cloud ML engine of a single instance — BASIC

· Use tensorboard web UI to visualize model metrics

· Run same training job in distributed machines for scale.

· Deploy model in the production setting

· Request online prediction and see the response

· Request a batch prediction

Before that a brief overview of ML engine in GCP. It is a fully managed service from Google Cloud to let machine learning engineers focus on model development rather than the complexity of infrastructure and configurations. It lets seamless operations for model training and prediction at the most cost-effective way using Google’s cloud scalable and highly available infrastructure. It supports TensorFlow models but soon scikit-learn, xgboost, keras models will be supported.

Cost: in case you would like to experiment with this on your own. You need to create a google cloud account using your Gmail ID and billing details need to be filled in using a valid credit card. But don’t worry GCP provides $300 free credit upon sign-up which is good enough to use for this experiment and many other experiments like this. Here are details of how to set up a GCP account https://cloud.google.com/billing/docs/how-to/manage-billing-account

Experiment:

Let’s get started with getting a prebuilt model from cloud machine learning samples by google and start training and predicting. We will run the whole operation using google cloud shell but as required we will go to console to review results of our actions in shell. But you can also choose to do it in your laptop if you install Google Cloud SDK. Here is the instruction to get started using Google Cloud SDK https://cloud.google.com/sdk/docs/.

Note, this type of tutorial is best done through video where all the actions can be seen but I have chosen to write as I feel comfortable writing than creating a video. So if video tutorial your thing then definitely go to youtube and there are numerous tutorials like this from GCP team and others. But I will post screenshots as much as possible to show the commands and results here.

Project:

In Google cloud resource management and billing are managed through a project, hence first thing you need to do to create a project in google. The project ID and project number is unique and you can choose a project name.

Training:

Once you login to google console opens a cloud shell as shown in here. In the console home page, you will see a symbol like “>_” click it and a shell window will open.

In the shell prompt set your project so that all operations are done using your project.

$gcloud config set project [selected-project-id] #replace with your project name.

Here is my command and output:

$gcloud config set project skilled-circle-191220

Updated property [core/project].

Check if there is any model already exists.

$gcloud ml-engine models list

My output will show models that I already created. In case your first model it will show nothing.

Run this command to update necessary installations

$gcloud components update

Run install/upgrade tensorflow

$pip install --user --upgrade tensorflow

Check installations:

$python

import tensorflow as tf

hello = tf.constant(“Hello Tensorflow!”)

sess = tf.Session()

print(sess.run(hello))

#It should print “Hello Tensorflow”.

exit()

At this point, we are good to start. We need a model to train, simply download a model from google cloud sample models from their GitHub by running this command at the cloud shell.

$wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip

$unzip master.zip

Navigate to the estimator model directory under census.

$cd

$cd ~/cloudml-samples-master/census/estimator/

Get the training and evaluation data from google’s provided public GCS bucket

$mkdir data

Copy the training and evaluation data from google cloud public bucket

$gsutil -m cp gs://cloud-samples-data/ml-engine/census/data/* data/

Set the training data and evaluation data path variable.

$TRAIN_DATA=$(pwd)/data/adult.data.csv

$EVAL_DATA=$(pwd)/data/adult.test.csv

Run sample requirements.txt to ensure we’re using same version of TF as sample

$sudo pip install -r ~/ cloudml-samples-master/census/requirements.txt

Specify an output directory

$MODEL_DIR=output

Make sure output directory is empty

$rm -rf $MODEL_DIR/*

Run local training using gcloud

$gcloud ml-engine local train--module-name trainer.task \--package-path trainer/ \--job-dir $MODEL_DIR \--train-files $TRAIN_DATA \--eval-files $EVAL_DATA \--train-steps 1000 \--eval-steps 100

Check model metrics using tensorboard, run

$tensorboard — logdir=$MODEL_DIR — port=8080

Open web preview in the cloud shell top right at port 8080.

It should open a browser at your local machine and model metrics can be seen.

You may also find more information about tensorflow model from the graph.

Distributed Mode: In production, a model needs to be trained on scalable infrastructure where machines can be added as more compute resource is required for performance. GCP cloud ML engine provides provision for running training in a distributed model where multiple ml engine instance will be running to complete training job. In order to use distributed training just add –distributed parameter in the ml-engine training command above as follows:

$gcloud ml-engine local train \    --module-name trainer.task \    --package-path trainer/ \    --job-dir $MODEL_DIR \    --distributed \    -- \    --train-files $TRAIN_DATA \    --eval-files $EVAL_DATA \    --train-steps 1000 \    --eval-steps 100

Deploy model for prediction:

Now we can test model both online and in batch passing a test.json file.

$MODEL_NAME=census

Create an ml-engine model with version 1

$gcloud ml-engine models create $MODEL_NAME — regions=$REGION

Set an output path for model output to be stored

$OUTPUT_PATH=gs://$PROJECT_ID/census_dist_1

Look up and set a full path for export trained model binaries

$gsutil ls -r $OUTPUT_PATH/export

Look for directory $OUTPUT_PATH/export/census/<timestamp> and copy/paste timestamp value (without colon) into the below command

MODEL_BINARIES=gs://$PROJECT_ID/census_dist_1/export/census/<timestamp>

Here is what it looks like for me.

Create the model version for prediction using the command below.

$gcloud ml-engine versions create v1 \

— model $MODEL_NAME \

— origin $MODEL_BINARIES \

— runtime-version 1.4

After the model creation is complete you can go to a console and go to ml engine to see the model version is there.

Run prediction against test.json file online as

$gcloud ml-engine predict \

— model $MODEL_NAME \

— version v1 \

— json-instances \

../test.json

You should see output as (the exact data could be different for you.)

CLASSES       PROBABILITIES[u'0', u'1']  [0.9969545602798462, 0.0030454816296696663]

For batch prediction, it will launch a ml-engine job to complete prediction which is in most of the time case for production implementation.

$JOB_NAME=census_prediction_1

$OUTPUT_PATH=gs://$PROJECT_ID/$JOB_NAME

$gcloud ml-engine jobs submit prediction $JOB_NAME \

— model $MODEL_NAME \

— version v1 \

— data-format TEXT \

— region $REGION \

— input-paths $TEST_JSON \

— output-path $OUTPUT_PATH/predictions

Wait for few minutes and you can see the model output of prediction using

$gsutil cat $OUTPUT_PATH/predictions/prediction.results-00000-of-00001

Or, you can go to console and navigate your GCS bucket to check the output.

Output:

{"probabilities": [0.917137086391449, 0.0828629657626152], "logits": [-2.4040687084198], "classes": ["0"], "class_ids": [0], "logistic": [0.0828629583120346]}

Cleaning up:

Make sure to delete all the files you stored in GCS bucket in order to avoid billing using

$gsutil rm -r gs://$BUCKET_NAME/$JOB_NAME

Recap: So what we covered here:

1. Train a model in cloud ml-engine.

2. Saved model output in GCS bucket.

3. Train the model both in a single instance and distributed mode.

4. Reviewed model metrics using tensorboard.

5. Created a model version for prediction

6. Used prediction model for both online and batch prediction.

7. Reviewed prediction outcome.

Please let me know if you face any issue running these commands and experiment and please clap if you like it. Thank you.

--

--