Video Generation With pix2pix

JC Testud
7 min readNov 26, 2018

--

How Can We Generate Videos For Fun?

Generative Machine Learning is cool, it is a fact. We can produce fun text, and do all kinds of stuff with images. This includes now, generating as many high-resolution cheeseburgers as we want. Video data, however, is another beast.

To generate videos, the intuitive answer, technology-wise, would also be a GAN. Let’s say we want to generate more videos of cats. We could take all the cat videos from YouTube as training data, and make the generator and discriminator battle for ages until the former produces realistic cat videos.

"We're not here yet"

Unfortunately, we are not here yet. At some point, we will probably see a YoutubeGAN from Google, but, it will probably use a gazillion TPU-hours of compute to train. On the flip side, we will be able to generate an infinite number of cat videos. Yeah!

In the meantime, with our single gaming GPU, what can we do, us mortals?

Next-Frame Prediction

One simpler (but limited) option is to learn to predict the next frame in a video.

To generate videos:

  1. Find a cool video (it should be simple and predictable)
  2. Train a model to predict the future frame, from past frame(s)
  3. Choose a seed image and start a feedback loop
    i.e. predict the next frame, use that as your new input, etc.
  4. Build a video with the generated frames

The tricky part is obviously the training step, what kind of model should we use?

Pix2Pix

You probably already know pix2pix, it is the all-purpose Swiss-knife image-to-image translation architecture behind edges2cat, black&white to color, day to night, ...

Do you know there is now an edges2pikachu? You really can’t stop progress.

Speaking of progress, here is a very personal GAN-paper timeline, if you wonder how we got here:

Using pix2pix For Next Frame Prediction

Pix2pix (with different twists) has been used by several people for video generation. The most prominent user (and maybe inventor of that kind of usage?) is Mario Klingemann. One work in particular made the news, his generative fireworks. This guy is amazing, I highly encourage you to follow him on twitter (Mario Klingemann).

Video Generation Walk-through with pix2pixHD

To train our model, we are going to use the PyTorch implementation of Nvidia’s pix2pixHD architecture. Everything we will need is available in my fork of the project:

git clone -b video https://github.com/jctestud/pix2pixHD.git

pix2pixHD is designed to ingest a lot of information about the input image (for example, a label map that identifies where are the pedestrians, the cars, etc.)

We don't need that. In our case, it will be pretty straightforward:

  • as input, we will use a RGB image
  • as output, we will also use a RGB image (the next frame)

Note: the generator has an extremely poor context (with just one past frame, to predict the next one), but it kinda works on simple motion video.

In this walk-through, I am going to use the following fire video. It is just 1 minute long but it will do the job: https://vimeo.com/62740159

Let's start by extracting the frames from the video:

python3 extract_frames.py -video ~/Downloads/fire.mp4 -name fire_dataset -p2pdir . -width 1280 -height 736

Note: The network architecture requires image sizes that are divisible by 32. The closer we can get to 720p is then 1280x736. By default, my script will try to extract frames that way.

Let's now train a model for one epoch (one pass over the entire video):

python3 train_video.py --name fire_project --dataroot ./datasets/fire_dataset/ --save_epoch_freq 1 --ngf 32

Note: My code generates the image pairs on the fly. To have a look at what is actually generated, add the “debug” flag.

Also, the “ngf” option, that I am using, is one of the many tricks you can try, to reduce the GPU memory consumption of the algorithm (to hopefully make it fit on a 12GB or even 8GB GPU)

You can now generate a video with the weight checkpoint that was just saved:

python3 generate_video.py --name fire_project --dataroot ./datasets/fire_dataset/ --fps 24 --ngf 32 --which_epoch 1 --how_many 200

Note: in this example, we generate 200 frames of size 1280x736. The script then uses the first 60 frames of the original video (The 60th frame serves as seed for the feedback loop) and adds the 200 generated ones to build a video in 1280x720 - 24 fps.

In my case, I get this:

Now, let your GPU alone for a couple hours. Then, come back and appreciate how fast pix2pixHD just learned to generate flames:

Note: to generate that kind of video, the command is:

python3 generate_progress_video.py --name fire_project --dataroot ./datasets/fire_dataset/ --fps 24 --ngf 32 --pstart 1 --pstop 47

The "Scheduled Sampling" Trick — For More Stability (in theory)

While looking around for similar projects (spoiler: the code is usually not available), someone pointed me to a recent video generation project that is part of Google's Magenta. The author, Damien Henry, is using the original pix2pix architecture (a tensorflow port to be precise) to generate videos.

One cool addition to the project is the use of a special training trick to mitigate the divergence problem in the generated videos.

This trick is inspired by this paper: Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, Samy Bengio et al., 2015

In the paper, the idea is tested on character sequences and RNNs, but it applies directly to frames. In a nutshell, the idea is to randomly (a smart random), use a fake input (one generated by the generator/model) to build the training pairs. By making the model learn to go from a divergent prediction at step <t> to the ground truth at step <t+1>, we may force it to auto-correct, or learn how to go back on track, and it could ultimately stabilize the generative process.

Note: the divergent input could come from one or several recursions in the generator.

Here is an example with frames:

Traditionally, the training process will use pairs of real images as input/output like A-C. In a "scheduled sampling" training, you also use generated fake images as input and build pairs like B-C.

In my (simple) implementation, everything is done on the fly in the training loop. While iterating over the video, each time we perform a forward pass, the generated fake image is saved. The next step will have a 1/2 (configurable) chance of choosing the latest generated frame as input. With that “dice roll” configuration, you have 1/4 chance of using a second recursion as input, 1/8 chance of using a third recursion, etc.

In the magenta implementation, it is a bit different. For more information, I encourage you to read Damien’s blog post.

Scheduled Sampling In Practice

The results are not necessarily as interesting as the idea (in my case). It is sometimes not working at all, sometimes it is (?) but it then produces some (intended?) rollback motions that remove some of the craziness we like in generative ML. I wish I had a better understanding of these failures, but testing and tuning custom architectures is very time-consuming…

Here is one interesting example however, to end on a more positive note.

Let's use an input video with some stable spatial structure we don't want to lose. In this case, the shore of a river.

static rocks and moving water — source (again)

The training command (with scheduled sampling on) is the following:

python3 train_video.py --name water_project --dataroot ./datasets/water_dataset/ --save_epoch_freq 1 --ngf 32 --scheduled_sampling --serial_batches

Here are the results:

We clearly see the part played by scheduled sampling. In the SS-ON model, the rocks are never disappearing, and water is still flowing (kind of) through them, this is fun :)

The SS-free model is fun in its own way too, when it imagines grass-like moving shorelines.

Future Work :)

If I find the time, I would like to:

  • make everything more stable (each checkpoint is so different this makes me crazy…)
  • start investigating overfitting,
  • try out other architectures, especially some that are built specifically for video generation. I am thinking about the recent vid2vid project (by Nvidia, obviously)

Anyway, the code is available if you want to experiment or contribute. Have fun!

(BONUS) Generating Fireworks!

Here is one of my attempt at generating fireworks, from a model that was fine-tuned with some "scheduled sampling" (you can actually notice it, some flares are weirdly slowing down and sometimes going backwards)

I also manually zoomed-in during the feedback loop.
This is what creates this Millennium Falcon effect.
As an interesting side-effect, it also wakes up the model a little :)

That’s it for this story. If you liked it, you can follow me on Medium or Twitter to get the latest news!

--

--