Generative Machine Learning on the Cloud


In the last year we’ve witnessed rapid advancements in hardware capabilities, continued development of user-friendly machine learning libraries, and AI connectivity for the maker community. With this backdrop of increasingly user-friendly AI, I spent the summer working with Google’s Artists & Machine Intelligence (AMI) program on a cloud-based tool to make generative machine learning and synthetic image generation more accessible, especially to artists and designers. This post will explain some common generative model structures as well as pitfalls and resources for people interested in coding their own.

Before I get to the project, a little about me.

Me, hiking the Enchantments

My name is Emily Glanz. I graduated from the University of Iowa about a year ago with a B.S. in Electrical Engineering, and have been working on various Google teams as part of the Engineering Residency program. My experience with machine learning prior to AMI was centered around prediction and classification tasks while working on a hearing loss diagnostic tool in college.

The goal of my AMI project was to lower the barrier to entry for using Google’s Cloud ML infrastructure. I wanted to make it easier to train and use generative models for creative applications. Using the cloud gives users easy access to GPUs for training without needing to set up a workstation. A concise and easy to use TensorFlow example acts as a perfect starting point for modifications and customization. Check out the project on Github: GenerativeMLonCloud.

The end-to-end system design allows a user to provide a custom dataset of images to train a Variational Autoencoder Generative Adversarial Network (VAE-GAN) model on Cloud ML. From here, their model is deployed to the cloud, where they can input an embedding to have synthetic images generated from their dataset or input an image to get an embedding vector. In addition, I created an App Engine web application to demonstrate using Cloud ML’s python API to interact with the deployed model. The scope for the current tool focuses on generative images, but we hope to add examples in the future that deal with other inputs such as text or audio.

CNNs, VAEs, and GANs

To kick off this project, I took the route taken by many jumping into neural nets: a few days spent on TensorFlow tutorials, a couple read throughs of Chris Olah’s fantastic blog, and some nice time digging through the endless examples of generative neural nets on Github.

After the Convolutional Neural Net (CNN) MNIST tutorial by TensorFlow, I was ready to dive headfirst into generative image models. I looked first at Variational Autoencoders (VAEs), then Generative Adversarial Networks (GANs), and ended up using a combo of a VAE-GAN as the final model for the image to image model.

The first step in developing the generative tool was generating handwritten numbers using a VAE which I created from the CNN tutorial. I started with a VAE as this particular network has been one of the most popular approaches to generative imagery in the past couple years. The MNIST dataset is commonly used as it is a standard benchmark for image based neural networks:

Generated MNIST digits using a VAE

The MNIST dataset is a great black and white image dataset to get started with — TensorFlow even provides a nicely formatted version of the set with their library.

A VAE is comprised of an Encoder network and a Decoder network. The Encoder takes input images and encodes them into embedding vectors that live in a latent space. The latent space has a lower-dimensionality than the input image which is why it is sometimes referred to as the ‘bottleneck’ of the network. This bottleneck forces the Encoder to learn an information-rich compression of the raw input data as it maps the image to the latent space. For example, one of the features learned by the VAE could be the amount of ‘smile’ in a face. A constraint is added to the Encoder that forces the network to create latent vectors that follow a unit gaussian distribution. The Decoder can reconstruct the given input from these embeddings. These models become generative when a randomly sampled vector from the unit gaussian (the distribution enforced in the Encoder) is passed into the Decoder; simultaneously the Decoder learns to use these embeddings to generate synthetic images from the latent space. The network is trained end-to-end: the Encoder learns the most important features of the input image, allowing the Decoder to reconstruct the input image from the latent vector representation.

Image Generation

In the variant I used, I have a couple convolutional layers in my encoder and a couple of convolutional transpose layers (aka “deconvolution”) in my decoder (I’ll get more into the detailed architecture of the VAE later). I found this tutorial a great explanation of VAEs.

At this point I took a detour into Conditional VAE (CVAE) land, and used the MNIST dataset to play with this autoencoder variation that conditions on label information to allow the user to specify which number they would like to generate. The CVAE is trained by appending the one-hot encoded vector representing the label of the input image (so if the input image is a 9, the label vector is [0,0,0,0,0,0,0,0,0,1]) to the input image and the latent space vector. Then to request a specific generated number, the user can input a random embedding sampled from the unit gaussian distribution combined with the one-hot encoded vector of the number desired.

Generated MNIST digits using a CVAE

One fun thing that can be done with a CVAE is to mix together two labels (in this case two numbers). Usually, we’re only supposed to set one of the bits of the one-hot vector high, but what happens if we set two bits high? What if we asked the CVAE to generate an image with the condition [0,0,1,0,0,1,0,0,0,0]? This is essentially asking the CVAE to generate an image with label 2 and 5. In this case, the decoder tries to generate an image that matches this condition, resulting in an image that looks like a 2 combined with a 5. Beyond digits, this feature of CVAEs could be used to combine images of different labels — one such application could be used for generating synthetic faces matching specific attributes, like ‘female, brunette, etc’. This chart shows what happens when each number (0 through 9) is combined with 2:

Each digit combined with 2

For the above image, the number requested is the desired number (0 through 9) OR’ed with the one-hot representation of 2. For example, to get a 7 combined with a 2, the embedding vector looks like: [0,0,1,0,0,0,0,1,0,0] concatenated with a random sampling from the unit gaussian!

Next, it was time to add a GAN (Generative Adversarial Network) loss onto the end of the VAE to sharpen the generated output. VAE’s tend to make images blurry because of the way the network is penalized while training. For VAEs, the reconstruction cost (typically a mean-squared-error (L2) loss), penalizes slightly moved edges and features with respect to the input image. Adding a GAN, which uses an adversarial loss of the Generator vs Discriminator described below, sharpens the output as this loss is more forgiving to exact reconstruction and focuses on the realism of the image features instead.

A GAN is trained using adversarial learning. A GAN is comprised of two networks, a Generator and a Discriminator. The Discriminator’s goal is to correctly distinguish between “real” and “fake” input (in this case, real MNIST images from generated MNIST images). The Generator’s goal is to produce output that fools the Discriminator. These two networks play a game to see who can beat whom. In the case of adding a GAN loss to the VAE, the VAE supplies the generator and all we need to do is add a discriminator network. Check out this blog for a more thorough run down on GANs.

Less fuzzy output from the VAE-GAN!
MNIST digits generated by the VAE
MNIST digits generated by the VAE-GAN

From this point, I started developing a model for RGB images. I took the VAE-GAN architecture I had used with the MNIST digits, and beefed it up with inspiration from: this DCGAN, this training technique, and this VAE-GAN on Github. The DCGAN link is where most of the layer architecture originated for this VAE-GAN.

A very simplified view of the network looks visually like this:

The Encoder Network:

The Decoder / Generator Network:

The Discriminator Network:

Batch normalization was used in each of the networks. While training the VAE-GAN I encountered all the woes associated with GAN training, including mode collapse, exploding gradients, and generated noise.

Some of the generated faces

Another way to explore the embedding space is using spherical linear interpolation, aka slerp. This technique, introduced in this paper and applied to VAEs and GANs in this paper allows me to traverse the space between two known embedding vectors. For example, I can take an image of a smiling women, take the picture of a not smiling man, and explore the transformation between the two in the latent space.

Demonstration of exploring the latent space with ‘slerping’

The top images are the reconstructions of the first input image, and the bottom images are the reconstructions of the second input image, with the images between the transitions. The second from the right column shows the result of using a non-face image, a picture of a crow, in the mix. The rightmost column shows the result of two non-face images (a dog and a cat). The autoencoder has trouble reconstructing the images of animals because the latent space has been customized for faces, specifically from those of the CelebA dataset.

The equation for slerp:

Where q1 and q2 are the embedding vectors produced by the encoder from the two input images, is the angle between the two vectors, and parameter mu varies from 0 to 1.

Interesting Training Difficulties

Training generative networks can be tricky and it’s worth recognizing some of the common ailments and their remedies, so let’s detour into the previously mentioned woes.

Case: Generator collapsed and produces a single example (mode collapse)

This case shows that a generator does not always converge to the data distribution.

Here you can see the generator converging to a single example

At first it appears that the VAE-GAN is starting to learn different features (like hair, face orientation, almost sunglasses at one point) but then we see the generator (in this case the decoder of the VAE portion) breaks down and only produces one example. Playing around with learning rates for the networks and batch normalization solved this problem in my specific case.

Case: Generator too strong, exploiting non-meaningful weakness of discriminator (loss / gradients exploded)

The consequences of a generator not trained properly

The generator first just generates images of a solid color. The generator is not successfully generating images even close to face-like. Training the generator and discriminator based on loss thresholds kept one network from getting too much stronger than the other for me.

Case: Learning rate too high for VAE over discriminator network

A learning rate too high

By just altering the learning rate of the VAE, the network started to generate noisy faces. Experimenting with learning rates for the different optimizers was tricky.

Case: Choosing the correct parameters / network architecture

Choosing the appropriate embedding size, number of training steps, etc. is crucial to getting realistic output from a GAN. I found this github site to have some awesome tips of GAN training.

An old version of the model:

Embedding Dimension: 2048, Training Steps: 20000, Batch Size: 10, ReLU activations, sigmoid as final activation in the Decoder/Generator, no batch normalization

Current model:

Embedding Dimension: 100, Training Steps: 80000, Batch Size: 64, Leaky ReLU in Discriminator, batch normalization in Encoder, Decoder/Generator, and Discriminator, tanh as final activation in Decoder/Generator (so image values in [-1,1] instead of [0,1])
Further sources of information for GAN training difficulties:

Using Cloud ML

Once I had a generative model for images, it was time to really solidify the end-to-end system, the main goal of this project.

The dream is to allow a user (with a directory of images) to train their very own VAE-GAN model on their very own image dataset.

System Design

System for Generative ML on the Cloud

Here is an overview of the steps required to make the user’s generative model dreams a reality:

  1. Preprocessing: a directory of images (either JPEG or PNG) is converted into TFRecords and split into evaluation and test datasets. These are stored in the user’s Cloud Storage bucket.
  2. Training Job: a training job kicks off training of the VAE-GAN model using the user’s TFRecords (on GCS) as input. The TensorFlow VAE-GAN code is packaged and uploaded to the user’s GCS bucket. The model is trained using Cloud ML Engine (GPUs/CPUs/RAM specified in config file) with the checkpoints and final SavedModel being saved to GCS.
  3. Create and Deploy Model: a model is created and the SavedModel code is then deployed onto the Cloud ML Engine.
  4. Prediction (Generation) Jobs: the prediction API is used to access the trained model hosted on Cloud ML. For one mode, Embeddings are sent as input, with a synthetic image acting as output. For the second mode, an input image is supplied, with an embedding acting as output. This job can be run from the command line using the cloud sdk or from the python library. I used an App Engine project to provide a sample interface for the user to generate images from two trained models.

System Setup

To get the tool up and running on Cloud ML, first the Cloud environment has to be set up. A Cloud Platform project has to be set up on the projects page, billing has to be setup, and then the Cloud ML engine and Compute engine APIs have to be enabled. To use the command line interface, the Cloud SDK must be installed. Follow these instructions to set up the cloud environment.

Running the System

From here, the user can begin running training jobs. I created a script that allows the user to specify an image directory and then takes care of preprocessing the images and starting the training job on Cloud ML. Other flags allow the user to further tune their training/preprocessing tasks such as center-cropping the images or which port to start their TensorBoard instance (TensorBoard: the greatest way to monitor any TensorFlow training).

A screen grab of the TensorBoard instance during training

Another script I created allows users to create and deploy their models they created from running training jobs on Cloud ML. Once on Cloud ML, getting generated images or image embeddings is one API call away.

End Notes

Playing around in with VAEs and GANs let me generate some fun images:

Beyond faces…

The MNIST dataset and CelebA dataset are great datasets to test and develop a network — but what else could one use to autoencode and generate?

Here are some of my favorite generative art projects for inspiration:

8 bit art by Adam Geitgey
Cats by Alexia Jolicoeur-Martineau
GANGogh by Kenny Jones and Derrick Bonafilia
Fake Kanji Experiment by David Ha


This work was supported by Google’s Engineering Residency program, on my rotation with Artists and Machine Intelligence. I’d like to thank Larry Lindsey and Mike Tyka, for guiding me in generative machine learning and TensorFlow, as well as the entirety of AMI for answering any questions I had and giving me fantastic insight into the world of AI. Huge shoutout to Jac de Haan and Kenric McDowell for all the support for the project as well.