High Resolution Face2Face with Pix2Pix 1024x1024

Input video | Detected Face | Generated Pix2Pix output

Inspired by this work Dat Tran, I prepared my own dataset and trained improved Pix2Pix net to generate Polish youtuber Krzysztof Gonciarz creating show “Zapytaj Beczkę”.

But first let’s give it a try:

In my first approach I just trained the original net from face2face-demo. It worked! But it was only 256 x 256 resolution which was not enough for me, so I decided to increase resolution to 1024 x 1024.

High resolution mod

To increase resolution one needs to add layers to encoder and decoder, there is no simpler way to do it. So I added few layers and tuned parameters to fit my 8GB memory on GTX980M in my laptop.

If you need to modify input/output size just add or remove some layers in encoder and decoder:

Encoder before (ngf = 64)

# encoder_2: [batch, 128, 128, ngf] => [batch, 64, 64, ngf * 2]
# encoder_3: [batch, 64, 64, ngf * 2] => [batch, 32, 32, ngf * 4]
# encoder_4: [batch, 32, 32, ngf * 4] => [batch, 16, 16, ngf * 8]
# encoder_5: [batch, 16, 16, ngf * 8] => [batch, 8, 8, ngf * 8]
# encoder_6: [batch, 8, 8, ngf * 8] => [batch, 4, 4, ngf * 8]
# encoder_7: [batch, 4, 4, ngf * 8] => [batch, 2, 2, ngf * 8]
# encoder_8: [batch, 2, 2, ngf * 8] => [batch, 1, 1, ngf * 8]

Encoder after (ngf = 64)

# encoder_2: [batch, 512, 512, ngf] => [batch, 256, 256, ngf * 2]
# encoder_3: [batch, 256, 256, ngf * 2] => [batch, 128, 128, ngf * 4]
# encoder_4: [batch, 128, 128, ngf * 4] => [batch, 64, 64, ngf * 8]
# encoder_5: [batch, 64, 64, ngf * 8] => [batch, 32, 32, ngf * 8]
# encoder_6: [batch, 32, 32, ngf * 8] => [batch, 16, 16, ngf * 16]
# encoder_7: [batch, 16, 16, ngf * 16] => [batch, 8, 8, ngf * 16]
# encoder_8: [batch, 8, 8, ngf * 16] => [batch, 4, 4, ngf * 16]
# encoder_9: [batch, 4, 4, ngf * 16] => [batch, 2, 2, ngf * 16]
# encoder_10: [batch, 2, 2, ngf * 16] => [batch, 1, 1, ngf * 32]

Training data

I used 4 videos as input for training:

I used frames with one face detected from every 10th frame of these videos and manually removed some bad examples (e.g. other people).

Results

Finally after 3 days of training (GTX980M) I get these results:

Source | Detected face | Pix2Pix output
Mix of Pix2Pix or source video where face is not detected

Source code

You can find source code on my GitHub:

Conclusion

The original model (with Angela Merkel) was working good only in one position and distance from camera. Here I have similar problem — face was very close to the camera in all training videos. If you are going to train net to generate faces — remember to prepare very good dataset. The increase of input/output resolution was really good idea!

From my experience it was super easy to train this net and get first outputs. Totally different than with YOLO/SSD networks, where you need images and annotations in specified format. I will write when I successfully train YOLO or SSD!


Any ideas what to do next? Stay tuned for more deep learning posts!

Please check Dat Tran stories:

If you like this work show me your support!
Follow me on: