High Resolution Face2Face with Pix2Pix 1024x1024

3 min readAug 18, 2017

Input video | Detected Face | Generated Pix2Pix output

Inspired by this work Dat Tran, I prepared my own dataset and trained improved Pix2Pix net to generate Polish youtuber Krzysztof Gonciarz creating show “Zapytaj Beczkę”.

Face2face — A Pix2Pix demo that mimics the facial expression of the German chancellor

Inspired by one of Gene Kogan’s workshop, I created my own face2face demo that translates my webcam image into the…

medium.com

But first let’s give it a try:

In my first approach I just trained the original net from face2face-demo. It worked! But it was only 256 x 256 resolution which was not enough for me, so I decided to increase resolution to 1024 x 1024.

High resolution mod

To increase resolution one needs to add layers to encoder and decoder, there is no simpler way to do it. So I added few layers and tuned parameters to fit my 8GB memory on GTX980M in my laptop.

If you need to modify input/output size just add or remove some layers in encoder and decoder:

Encoder before (ngf = 64)

# encoder_2: [batch, 128, 128, ngf] => [batch, 64, 64, ngf * 2]
# encoder_3: [batch, 64, 64, ngf * 2] => [batch, 32, 32, ngf * 4]
# encoder_4: [batch, 32, 32, ngf * 4] => [batch, 16, 16, ngf * 8]
# encoder_5: [batch, 16, 16, ngf * 8] => [batch, 8, 8, ngf * 8]
# encoder_6: [batch, 8, 8, ngf * 8] => [batch, 4, 4, ngf * 8]
# encoder_7: [batch, 4, 4, ngf * 8] => [batch, 2, 2, ngf * 8]
# encoder_8: [batch, 2, 2, ngf * 8] => [batch, 1, 1, ngf * 8]

Encoder after (ngf = 64)

# encoder_2: [batch, 512, 512, ngf] => [batch, 256, 256, ngf * 2]
# encoder_3: [batch, 256, 256, ngf * 2] => [batch, 128, 128, ngf * 4]
# encoder_4: [batch, 128, 128, ngf * 4] => [batch, 64, 64, ngf * 8]
# encoder_5: [batch, 64, 64, ngf * 8] => [batch, 32, 32, ngf * 8]
# encoder_6: [batch, 32, 32, ngf * 8] => [batch, 16, 16, ngf * 16]
# encoder_7: [batch, 16, 16, ngf * 16] => [batch, 8, 8, ngf * 16]
# encoder_8: [batch, 8, 8, ngf * 16] => [batch, 4, 4, ngf * 16]
# encoder_9: [batch, 4, 4, ngf * 16] => [batch, 2, 2, ngf * 16]
# encoder_10: [batch, 2, 2, ngf * 16] => [batch, 1, 1, ngf * 32]

Training data

I used 4 videos as input for training:

Zapytaj Beczkę #145 https://youtu.be/_aEnuh9tnKg
Zapytaj Beczkę #143 https://youtu.be/yI8fJtcNkRA
Zapytaj Beczkę #140 https://youtu.be/gii3w0t1KIs
Zapytaj Beczkę #139 https://youtu.be/X3asNGDdUpw

I used frames with one face detected from every 10th frame of these videos and manually removed some bad examples (e.g. other people).

Results

Finally after 3 days of training (GTX980M) I get these results:

Source | Detected face | Pix2Pix output

Mix of Pix2Pix or source video where face is not detected

Source code

You can find source code on my GitHub:

karolmajek/face2face-demo

face2face-demo - This is the accompanying code for the Medium article:

github.com

Conclusion

The original model (with Angela Merkel) was working good only in one position and distance from camera. Here I have similar problem — face was very close to the camera in all training videos. If you are going to train net to generate faces — remember to prepare very good dataset. The increase of input/output resolution was really good idea!

From my experience it was super easy to train this net and get first outputs. Totally different than with YOLO/SSD networks, where you need images and annotations in specified format. I will write when I successfully train YOLO or SSD!

Any ideas what to do next? Stay tuned for more deep learning posts!

Please check Dat Tran stories:

Dat Tran - Medium

Hey yeh I know this dataset but I wanted to do it from scratch:) The raccoon case is also a proxy for other cases where…