High Resolution Face2Face with Pix2Pix 1024x1024
Inspired by this work Dat Tran, I prepared my own dataset and trained improved Pix2Pix net to generate Polish youtuber Krzysztof Gonciarz creating show “Zapytaj Beczkę”.
But first let’s give it a try:
In my first approach I just trained the original net from face2face-demo. It worked! But it was only 256 x 256 resolution which was not enough for me, so I decided to increase resolution to 1024 x 1024.
High resolution mod
To increase resolution one needs to add layers to encoder and decoder, there is no simpler way to do it. So I added few layers and tuned parameters to fit my 8GB memory on GTX980M in my laptop.
If you need to modify input/output size just add or remove some layers in encoder and decoder:
Encoder before (ngf = 64)
# encoder_2: [batch, 128, 128, ngf] => [batch, 64, 64, ngf * 2]
# encoder_3: [batch, 64, 64, ngf * 2] => [batch, 32, 32, ngf * 4]
# encoder_4: [batch, 32, 32, ngf * 4] => [batch, 16, 16, ngf * 8]
# encoder_5: [batch, 16, 16, ngf * 8] => [batch, 8, 8, ngf * 8]
# encoder_6: [batch, 8, 8, ngf * 8] => [batch, 4, 4, ngf * 8]
# encoder_7: [batch, 4, 4, ngf * 8] => [batch, 2, 2, ngf * 8]
# encoder_8: [batch, 2, 2, ngf * 8] => [batch, 1, 1, ngf * 8]
Encoder after (ngf = 64)
# encoder_2: [batch, 512, 512, ngf] => [batch, 256, 256, ngf * 2]
# encoder_3: [batch, 256, 256, ngf * 2] => [batch, 128, 128, ngf * 4]
# encoder_4: [batch, 128, 128, ngf * 4] => [batch, 64, 64, ngf * 8]
# encoder_5: [batch, 64, 64, ngf * 8] => [batch, 32, 32, ngf * 8]
# encoder_6: [batch, 32, 32, ngf * 8] => [batch, 16, 16, ngf * 16]
# encoder_7: [batch, 16, 16, ngf * 16] => [batch, 8, 8, ngf * 16]
# encoder_8: [batch, 8, 8, ngf * 16] => [batch, 4, 4, ngf * 16]
# encoder_9: [batch, 4, 4, ngf * 16] => [batch, 2, 2, ngf * 16]
# encoder_10: [batch, 2, 2, ngf * 16] => [batch, 1, 1, ngf * 32]
Training data
I used 4 videos as input for training:
- Zapytaj Beczkę #145 https://youtu.be/_aEnuh9tnKg
- Zapytaj Beczkę #143 https://youtu.be/yI8fJtcNkRA
- Zapytaj Beczkę #140 https://youtu.be/gii3w0t1KIs
- Zapytaj Beczkę #139 https://youtu.be/X3asNGDdUpw
I used frames with one face detected from every 10th frame of these videos and manually removed some bad examples (e.g. other people).
Results
Finally after 3 days of training (GTX980M) I get these results:
Source code
You can find source code on my GitHub:
Conclusion
The original model (with Angela Merkel) was working good only in one position and distance from camera. Here I have similar problem — face was very close to the camera in all training videos. If you are going to train net to generate faces — remember to prepare very good dataset. The increase of input/output resolution was really good idea!
From my experience it was super easy to train this net and get first outputs. Totally different than with YOLO/SSD networks, where you need images and annotations in specified format. I will write when I successfully train YOLO or SSD!
Any ideas what to do next? Stay tuned for more deep learning posts!
Please check Dat Tran stories:
If you like this work show me your support!
Follow me on:
- Youtube: https://www.youtube.com/karolmajek
- Twitter: https://twitter.com/karol_majek
- Medium: https://medium.com/@karol_majek