Hello World on GCP ML Engine

Coding from Zero to Web Request

Published in

Google Cloud - Community

13 min readJun 16, 2018

While working on my MushroomBot project, I found that there wasn’t much documentation on how to use Google Cloud Platform ML Engine. I think this is because TensorFlow can do so many things and most examples are ~500 lines long that are very hand-wavy. To prove the value of something, you typically start with the simplest viable example. In this post, I’ll try a similar strategy to focus on building blocks and not on a fancy TensorFlow program.

What is ML Engine?

From my understanding, ML Engine is a set of tools to help you train and deploy machines learning models on Google Cloud Platform. If the platform is as easy as advertised, we should end up validating the following statement.

Focus on models, not operations

I think the best use case is the idea of deploying a machine learning model. This makes machine learning a much more develop-friendly process especially if the model isn’t the interesting part. The lifecycle of our algorithms is what we will focus on in this article.

Where do you start?

First things first, we need to find some good documentation.

“Developing a training application is a complex process that is largely outside of the scope of this document”

I don’t think that developing a training application is out of the scope of this article. I think developing an efficient or useful algorithm is out of the scope of this article. I understand that real TensorFlow application require a deep understanding of how the framework works, but at least give us something to start with. Rather than complain, we can (ironically) follow the rest of this page on how to proceed.

Starting with the samples directory, we can take a look at the Census example. Here, we see a total of ~800 lines of code and three files (blah). The first thing to do is try and understand this pattern. Why are there three files? What makes each file important?

Project Structure

Nestled away in the documentation are a few paragraphs about project structure. The picture on that page is immensely helpful and what we will start with.

$ tree .
.
├── setup.py
└── trainer
    ├── __init__.py
    ├── model.py
    ├── task.py
    └── util.py

I have copied their example setup.py file and have added a Makefile to construct a single source of truth for all useful commands. Here is the useful snippet of what each of these files means.

task.py contains the trainer logic that manages the job.
model.py contains the TensorFlow graph code—the logic of the model.
util.py if present, contains code to run the trainer.

While these descriptions are helpful, I am still not understanding the inputs and outputs of each file. Do these take command line parameters? Are they invoked directly or by another modules?

You can find some commands on how to run a trainer locally here. Adding this line to my Makefile, we have the following contents:

MODEL_DIR=./output
TRAIN_DATA=./train
EVAL_DATA=./evaltrain_local:
 gcloud ml-engine local train \
    --module-name trainer.task \
    --package-path trainer/ \
    --job-dir ${MODEL_DIR} \
    -- \
    --train-files ${TRAIN_DATA} \
    --eval-files ${EVAL_DATA} \
    --train-steps 1000 \
    --eval-steps 100

and the following directory structure:

$ tree .
.
├── Makefile
├── eval
├── output
├── setup.py
├── train
└── trainer
    ├── __init__.py
    ├── model.py
    ├── task.py
    └── util.py

The format of the files inside of eval and train will likely depend on out algorithm so it definitely feels like we need a working example soon.

Running the trian_local command, we get the following:

$ make train_local
gcloud ml-engine local train \
    --module-name trainer.task \
    --package-path trainer/ \
    --job-dir ./output \
    -- \
    --train-files ./train \
    --eval-files ./eval \
    --train-steps 1000 \
    --eval-steps 100

After, we can see that something happened because one of our files has been compiled:

$ find .
.
./eval
./Makefile
./output
./setup.py
./train
./trainer
./trainer/__init__.py
./trainer/__init__.pyc <----------------------
./trainer/model.py
./trainer/task.py
./trainer/util.py

However, it is pretty clean that we can put anything into this training step and the local trainer doesn’t seem to care if anything gets done.

(almost) Writing Some Code

The same guide does illustrate a barebones TensorFlow Program. However, the guide doesn’t tell us where to write code or which file should do what. The rest of the guide seems to do a bunch of cool stuff in the cloud but us plebes are still stuck in square one.

As we are trying to get to a model to serve a web request, perhaps we can work backward from there. The Deployment Model documentation has a pretty clear description of what it wants: SavedModel. Great! Given that the WebRequest needs to invoke our TensorFlow program, perhaps we need to structure these files to take a very specific data object.

Most of these documents reference a “serving function” which aligns with our intuition but doesn’t help with the contracts. Based on this documentation, there are recommendations on what to name the functions but nothing appears to be required.

While not referenced anywhere, there does exist a cloudml-template directory in the CloudML samples repository. We will focus on the /template directory as this contains the Python package structure. We are replacing the /scripts directory with our Makefile (and can likely copy a few example over from there).

In task.py, we can finally see the “start” of a Python program.

if __name__ == '__main__':
    main()

Looking through the main function, we can see that this module can be invoked multiple times.

# If job_dir_reuse is False then remove the job_dir if it exists
print("Resume training:", HYPER_PARAMS.reuse_job_dir)

Investigating further, we can find a RunConfig function which appears to

run_config = tf.estimator.RunConfig(
  tf_random_seed=19830610,
  log_step_count_steps=1000,
  save_checkpoints_secs=120,
  keep_checkpoint_max=3,
  model_dir=model_dir )

This function takes our model_dir which was defined earlier as model_dir = HYPER_PARAMS.job_dir. Confusingly, we have set job_dir equal to MODEL_DIR. Gah!

After the run_config is created, it is passed to another function to run the “experiment.” This other function, run_experiment, is another function we need to maintain in our repository (though I presume will follow a pattern across all ML Engine users). The run_experiment function ties together a few concepts. It starts with the TensorFlow data/dataset package, structures some of models using the estimator package, and sums up the variables using:

# train and evaluate
tf.estimator.train_and_evaluate(
  estimator,
  train_spec,
  eval_spec )

To summarize:

input.py should utilize the tf.data package to deal with turning your complex data source into something TensorFlow can understand
task.py should use functions from the tf.estimator package that deals with running your model. Think of functions that are independent of the type of model that is used.
model.py should return classes from the tf.estimator package. Think of something like LinearClassifier .

But what happened to the “serving function” we were trying to work back from earlier? Where does this get generated?

Because the serving function is seen as type of input, there is a json_serving_input_fn defined in input.py. This function is passed to FinalExporter in the run_experiment function mentioned above.

exporter = tf.estimator.FinalExporter(
  'estimator',
  input.SERVING_FUNCTIONS[HYPER_PARAMS.export_format],
  as_text=False )

This exporter is passed to a tf.estimator.EvalSpec which is later passed to the final train_and_evaluate function at the end of the run_experiment function.

To summarize, if you want to understand how to structure your TensorFlow package, I recommend working backward from the run_experiment function and seeing how the “tree” of dependencies works from there.

Working Backward-ish

Even though we know input.py is really the “start” of our data, let focus on the backbone of our training process (the main and train_experiment functions).

While somewhat silly, I am going to start with the following model.py :

import tensorflow as tfdef main():
    tf.estimator.train_and_evaluate(
        estimator,
        train_spec,
        eval_spec
    )if __name__ == '__main__':
    main()

After running, make test_local we can see that we are missing the TensorFlow package. I have updated the REQUIRED_PACKAGES in my setup.py to reflect the one defined in the cloudml-samples repository. This didn’t change any behavior because the ml-engine local train task doesn’t install your packages. I don’t really feel like digging through their source code but I would imagine that this command is basically just running Python somehow (hinting that we might have to set up a virtualenv or similar).

After some hacking around, I have come up with the following modified Makefile:

VIRTUALENV_DIR=./env
PIP=${VIRTUALENV_DIR}/bin/pip
ACTIVATE=source ${VIRTUALENV_DIR}/bin/activate# Python + Environmentvirtualenv:
  virtualenv ${VIRTUALENV_DIR}install: virtualenv
  ${PIP} install --editable .# TensorFlowMODEL_DIR=./output
TRAIN_DATA=./train
EVAL_DATA=./evalTRAINER_PACKAGE=trainer
TRAINER_MAIN=${TRAINER_PACKAGE}.tasktrain_local:
  bash -c '${ACTIVATE} && gcloud ml-engine local train \
    --module-name ${TRAINER_MAIN} \
    --package-path ${TRAINER_PACKAGE} \
    --job-dir ${MODEL_DIR} \
    -- \
    --train-files ${TRAIN_DATA} \
    --eval-files ${EVAL_DATA} \
    --train-steps 1000 \
    --eval-steps 100'

Unfortunately, there isn’t a good way to use virtualenv and Makefiles so I have wrapped the gcloud ml-engine command inside of a bash -c '' so that we can utilize the current environment’s Python environment instead of relying on our user’s installed packages. I would imagine we could go as far as to use some Docker hacks to get a standard environment across all development machines which might move the complexity of virtualenv/python to something standard across host operating systems.

By using pip install --editable ., we can install the current package in “edit” mode and centralize our dependencies in the setup.py file instead of requirements.txt that is occasionally mentioned in the Getting Started guide. I find any reference to a requirements.txt confusing because the recommended package structure is to use setup.py yet the Getting Started contains no references to installing our package as editable.

After updating the Makefile, train_local results in NameError: global name estimator is not defined which is what we were expecting.

Estimators, Optimizers, Operations

Now, you should be set to copy over parts of the cloudml-sample program. There is quite a bit of code at this point so I’ll cover the final result with a graph later highlighting the components.

I tried creating my own Estimator and found out the following:

To create an Estimator, you need and EstimatorSpec. This EstimatorSpec needs a training_op (think part of a TensorFlow Session graph). This training_op will performs some mathematical calculation and update the global_step variable (returned from tf.train.get_global_step()). Typically, a training_op is generated by calling the minimize function pre-created Optimizers like tf.train.GradientDescentOptimizer. While creating a new Optimizer might be as easy as creating a subclass, you’ll still run into issues with Operations.
Creating a new Operation in TensorFlow requires compiling C++ code and creating a Python wrapper. This means we can’t easily create some bogus training_op when we are trying to create a “model” that doesn’t require training.
Both tf.estimator.TrainSpec and tf.estimator.EvalSpec need their respective step parameters greater than zero.

If you want to create an Estimator, you’ll need a graph with gradients. I don’t believe this means you need a graph with gradients to deploy to ML Engine, but you probably won’t be able to use the gcloud ml-engine local train command at all.

Building a Minimal Graph

What is the simplest graph that can be built and used with a pre-created Optimizer?

I think the MNIST example is a good place to start. We can being removing layers and adjusting numbers until we end up with something smaller.

After a long investigation, I have come up with the following graph:

input_layer = tf.constant([[1.0]])
dense = tf.layers.dense(inputs=input_layer, units=1,
  activation=tf.nn.relu)
dropout = tf.layers.dropout(inputs=dense, rate=0.4,
  training=mode == tf.estimator.ModeKeys.TRAIN)
logits = tf.layers.dense(inputs=dropout, units=2)

I have no idea what this graph does, but it seemed to be enough to compile without gradient warnings.

Exporting the Graph

As soon as you rid yourself of the gradient warnings, you might find yourself in a place to export the graph. There are two things that you’ll need to export the graph: config and export_outputs.

If you end up passing a FinalExporter to your EvalSpec, you’ll be able to export your function when your model is run with the Eval mode. The FinalExporter depends on function that must return a ServingInputReceiver. While lacking an solid example, there is some information located in the documentation.

Even with an exporter, you won’t actually save a graph if you don’t pass configuration containing a model_dir to your Estimator.

Serving the Exported Graph

First, you’ll need a bucket to put your training and evaluation data. Our data is randomly generated so we don’t need to worry about this step. However, you will need a bucket for the following:

gcloud CLI to upload your package
ML Engine to output the finished model

To submit our job, we need to add the following to the Makefile.

JOB_NAME=${BUCKET_NAME}_2
BUCKET_JOB_DIR=gs://${BUCKET_NAME}/${JOB_NAME}
REGION=us-central1
RUNTIME_VERSION=1.5train_job:
 gcloud ml-engine jobs submit training ${JOB_NAME} \
    --job-dir ${BUCKET_JOB_DIR} \
    --runtime-version ${RUNTIME_VERSION} \
    --module-name ${TRAINER_MAIN} \
    --package-path ${TRAINER_PACKAGE} \
    --region ${REGION} \
    -- \
    --train-files ${TRAIN_DATA} \
    --eval-files ${EVAL_DATA}

After submitting the job to ML Engine, you should see your job finish in ~5 minutes.

I did a quick analysis of the cost of a training job and I think it is close to 2 cents per job minimum. You can read through the pricing page to find out more information.

After running the job, you can find your module uploaded in your bucket under an interestingly constructed path:

gs://${BUCKET}/${JOB_NAME}/packages/3c8688e2846d9cda3337beb86f8cb0926fc7ecd136180758f70ea06fe25898b8

In this directory, you’ll find your Python module. Downloading the module, we can see the following:

$ tree hello-world-0.1
hello-world-0.1
├── PKG-INFO
├── setup.cfg
├── setup.py
└── trainer
    ├── __init__.py
    ├── input.py
    ├── model.py
    ├── task.py
    └── util.py1 directory, 8 files

So where is the exported model? Nowhere to be found yet…

Instead, I decided to move forward and see how bad things could get. First, I tried creating a model from this directory (gcloud ml-engine models create and gcloud ml-engine versions create). If you try submitting the bucket containing the python files, you will get the following:

ERROR: (gcloud.ml-engine.versions.create)
FAILED_PRECONDITION:
Field: version.deployment_uri
Error: Deployment directory gs://XYZ/ is expected to contain exactly one of:
  [saved_model.pb, saved_model.pbtxt].

After running the ML Engine trainer locally, I confirmed that I was exporting everything in the output directory and was in fact creating a saved_model.pb file ( export/${FINAL_EXPORTER_NAME}/${TIMESTAMP}/saved_model.pb).

output
├── checkpoint
├── eval_estimator-eval
│   └── events.out.tfevents.1528682282.genslerj.machine
├── events.out.tfevents.1528682281.genslerj.machine
├── export
│   └── estimator
│       └── 1528682283
│           ├── saved_model.pb
│           └── variables
│               ├── variables.data-00000-of-00001
│               └── variables.index
├── graph.pbtxt
├── model.ckpt-1.data-00000-of-00001
├── model.ckpt-1.index
├── model.ckpt-1.meta
├── model.ckpt-100.data-00000-of-00001
├── model.ckpt-100.index
└── model.ckpt-100.meta5 directories, 13 files

With this knowledge, I went back to inspect the cloudml-sample codebase and the gcloud CLI tool. Now, we have been ignoring the command line arguments for quite some time. Finally, we come to the one argument that is required for our model, the --job-dir argument.

--job-dir=JOB_DIR
  A Google Cloud Storage path in which to store training outputs and other data needed for training.  This path will be passed to your TensorFlow program as --job_dir command-line arg. The benefit of specifying this field is that Cloud ML Engine will validate the path for use in training.  If packages must be uploaded and --staging-bucket is not provided, this path will be used instead.

And just like that!

After creating the hosted model version, we can call ml-engine predict to call out API. The JSON input will correspond to your JSON serving function. Here is my serving function:

def json_serving_function():
    inputs = {
        'one': tf.constant([1]),
        'zero': tf.constant([0])
    }return tf.estimator.export.ServingInputReceiver(
        features={
            'one': tf.constant([1]),
            'zero': tf.constant([0])
        },
        receiver_tensors=inputs,
        # receiver_tensors_alternatives=inputs
    )

Which corresponds to the following input examples:

{"one": 42, "zero": 100}
{"one": 0, "zero": 100}
{"one": 100, "zero": 0}

Testing the API, we get the following:

$ make test_model_version
gcloud ml-engine predict \
   --model helloworld_model \
   --version v1 \
   --json-instances ./json_instances.jsonl
OUTPUT
1.0

Now, I am not really sure anything is happening but I think I have detailed enough to help start experimenting (like matching receiver_tensors to the input to your model).

TensorFlow “Knowledge” Dependency Graph

At last, we reach the part where we can piece together all of the concepts into a somewhat manageable graph. I think this would serve a good way to describe how to use ML Engine from start to finish (and a few less hacks than this article).

Note that I am missing the train and eval directories. I think these might be optional because any dependency on the tf.data.Dataset might be able to pull data at runtime (not sure if that is a good idea, but theoretically possible).

If I had to recommend a path to using ML Engine, it would probably be the following:

Start with model.py. Most of those concepts are core to the rest of the wrapping code. Because you’ll need a gradient operation, you’ll want to develop intuition there for your use case.
Move on to input.py. Start with the Dataset package and just sift around there. You could build a substantial amount of code to read in your data so take this part seriously. You’ll also see you how your API will be hosted (serving function). This might point out how your want to structure data into your network and you may end up modifing code in model.py. Although you won’t be able to connect this work to the first part just yet, you might end up figuring out some “integration” issues.
Finally, you can piece it all together with task.py. While not required, I think you’ll end up fiddling with the train and eval step parameters.
Go all out with Hyperparameter Tuning and Distributed Training.
…
Profit!

Wrapping Up

Overall, this whole experience was a bit frustrating but I am very excited for patterns to emerge. I think GCP ML Engine is surprisingly accessible and flexible given that I was able to figure this out and I am not a data scientist or graduate/doctoral student. Stay tuned for more information on my specific use cases and how they end up working. Cheers and thank you for reading!

jgensler8/gcp-minimal-ml-engine-project

Contribute to gcp-minimal-ml-engine-project development by creating an account on GitHub.

github.com