How I Taught A Machine To Take My Job

Sam Snider-Held
7 min readApr 13, 2018

--

or Behavioral Cloning and 3D Procedural Content Generation

A simple landscape created entirely by an ML algorithm, taught by me.

In my last medium post, I discussed how we could use convolutional neural networks for gesture recognition in VR.

I concluded that while it was really cool, drawing objects was sometimes more tedious that having a simple menu. So that got me thinking…

What if I used neural networks to anticipate what objects I wanted to place?

Think about it like a surgeon’s assistant, getting the right tool at the right time, without being asked for it. It seemed like a logical next step.

But that’s when things started to get really trippy …

By training such a network, I have effectively imparted a fraction of my creative thinking into the network.

If such a network can guess what I want when I want it … does that make me obsolete? Could such a system replace me?

The answer is yes.

BUM BOM BAAAAAAAAAAAAM!!!!

This line of thinking was inspired by things like Udacity’s Self Driving Car Nano degree, Python Plays GTA V, and MarioFlow. These projects teach machines behavior by having a user play a video game. It is a technique that has been called Behavioral Cloning.

I wanted to take that idea and use it for creative applications, specifically 3D level design. Below I’ll talk about how I made such a neural network, and some of the ramifications.

Cloning Behavior

To clone my 3D level design behavior, I rely on two types of networks. Convolutional Neural Networks(CNN) and a Recurrent Neural Network (RNN).

We talked briefly about CNNs last time. Essentially they learn features at multiple hierarchies: edges, curves, leaves, branch, tree. This allows them to really understand images at a granular level.

RNNs learn sequences of things. You’ve probably seen RNNs that can write Harry Potter chapters. These networks work by being fed sequences of data. So if you feed it lots of sentences, it will start to learn which words usually come after each other.

What does the output look like?

So far I’ve trained the network on roughly five hours of me placing objects in a virtual space. This ends up being roughly 5000 images and labels. A respectable number, but more data would mean a more robust system.

Here is a video of the trained behavior clone in action:

Here is first person view of an environment it created:

See how the output exhibits certain visual and behavioral concepts? It’s not just random output.The trees are placed a certain distance from each other. The environment includes wide open spaces. Bushes and rocks are placed together in small groups, often with grass beside them. These are just some of the behaviors I taught it.

How To Train Your Clone

In order to teach a machine how to create 3D levels, you have to feed it something like this:

When I saw this image, I placed this asset.

I built a simple tool in Unity to train the network. You can see the process in the youtube video below:

VR Training!

If I want to place an object, I push a button. That button saves the current frame as a PNG, and records the PNG’s timestamp along with an integer representing the asset I placed at that time. If I don’t press the button, the frame is still saved but with an integer that corresponds to not placing an asset.

When I see a bunch of trees, I usually place another tree. Or when I see a bunch of rocks, I place a rock or some grass. But I also tend to place a bunch of the same assets right after each other. This means that the neural network will encode my behavior into a patterned sequence of visuals correlated with actions.

For the more technical minded, I used a pre-trained VGG16 network and a simple Keras RNN. This allows me to use a really good pre-trained neural network to analyze the images, and then create sequences of the analyzed images for the RNN. The VGG16 network encodes every frame of what I see into a feature vector. These vectors are placed in groups of 15 and the label is what action I took at the 16th image. After the prediction, the 16th image is pushed into the sequence, while the first image is popped out. The new label is what action I took on the 17th image.

Below describes how to use the built in Keras VGG16 model as a feature extractor.

model = keras.applications.VGG16(weights='imagenet', include_top=True)feat_extractor = Model(inputs=model.input, outputs=model.get_layer("fc2").output)
dataFeatures = [(feat_extractor.predict(x) for x in Images]

The features are then grouped into sequences and fed to a very simple RNN.

model = Sequential()
#input shape 15 vectors of 4096, length of the output from
#vgg16 feature extractor
model.add(SimpleRNN(256, input_shape=(15, 4096)))
model.add(Dense(5, activation='softmax'))

And here’s a small look at what my data looks like, an ordered list of filenames and a number representing the action I took after seeing that image.

20180331132802_0
20180331132804_1
20180331132805_0
20180331132806_1
20180331132808_1
20180331132808.png, I chose to place another tree, denoted by the integer 1.

The trained network is served via a Google Compute Instance. That way I don’t have do the NN processing and 3D graphics on the same computer.

Currently, the clone moves on rails along a prerecorded path created during training. These paths will be used in later research to generate new paths via some form of RNN.

What are the next steps?

I think this is a pretty successful POC. Some immediate next steps would be:

  1. Complete the path generation RNN, or implement something like Google’s SketchRNN.
  2. Continue training. This type of thing needs way more than five hours of training to develop a robust range of behavior.
  3. Optimize the architecture for faster prediction times. For instance, once I gather enough data, I might build my own CNN, since the VGG16 is a bit overkill and really is where the bulk of the prediction time resides. There are also a plethora of hyper parameters that could be fine tuned to make better predictions. I’m also not even sure if having sequences of feature vectors is the best method either. Perhaps just sequences of a CNN prediction?
  4. Running multiple clones at once, at 10x the normal speed.

What does it mean now?

The subtitle says “Behavioral Cloning and 3D Procedural Content Generation” but is this really 3D procedural content generation (procgen)?

It depends on who you ask. Procgen usually involves developers hard coding a set of rules into a system. These rules determine things like visuals, terrain, asset creation, or asset placement. While procgen is definitely fascinating and a passion of mine, maybe it’s not the most artist friendly endeavor?

My deep learning approach creates the same rules based systems, but it creates those systems a lot differently. Whereas the standard procgen system encodes these rules into logic flows within code files, this encodes the rules into a neural network architecture and its weights. Procgen requires developers. The deep learning method requires showing the machine what you want it to do.This could make the process of designing a procgen system way more user oriented. The downside is that changing that system requires retraining, which depending on the scope of your project could be more work.

Still, something like this could allow artists and developers create vast swathes of content very naturally. As gamers expect larger and more detailed worlds, does something like this allow developers to keep up with the pace?

What could it mean for the future?

We’re talking about systems that can copy what you do, just by watching your behavior. I’ve been talking about it in the context of 3D level design, but what about other fields?

For instance, a new Photoshop or Illustrator plugin, that can clone your artistic behavior?

What if this plugin could continue working on a .psd or .ai file even after you’ve gone home for the night?

Or what if you have a huge deadline, so you decide to spin up 200 behavior clones to help tackle ideation? What about 10,000? What if each of those 10,000 behavior clones could work 100x faster than you?

Even if Artificial General Intelligence is far away, what happens when we move the cutting edge in deep learning out of things like self driving cars, and into things like our creativity/productivity software?

What does the future of work look like when every employee can be 10,000 super employees?

In the midst of all this thinking about automation moving into the knowledge work sector, what if we find out that it’s not some nameless AI that automates your job?

What if it’s you that automates your job?

--

--

Sam Snider-Held

I’m a creative technologist working at MediaMonks, focusing on the intersection of AR, VR, AI, UX, and Creativity.