Generate Bitmoji from face using GAN

Have you ever wondered how social media companies generate bitmoji from your picture ? Check out how GAN aces in creating bitmoji from a single pic.

Sumeet Badgujar
Analytics Vidhya
5 min readAug 12, 2021

--

Most of us have created our own customized bitmoji and used them across different social media apps. Earlier the bitmoji were customizable with a limited amount of attributes that could be added or changed. But it al changed in an year or two. Now people can create bitmoji that look just like themselves. All thanks to fast development in Deep learning models. But how do they do it? Well, let’s explore how GANs do the job for us.

Generative Adversarial Networks is the most interesting idea in the last 10 years in Machine Learning” — Yann LeCun.

Quick overview on GAN

Basic understanding of GAN

GAN’s are an exciting and rapidly changing field, delivering on the promise of generative models in their ability to generate realistic examples across a range of problem domains.
Most notably in image-to-image translation tasks such as translating photos of summer to winter or day to night, and in generating photorealistic photos of objects, scenes, and people that even humans cannot tell are fake.

The 3 basic parts of GAN’s are -

  1. ) Generator — Using generator to create outputs, using random noise or in our case using kernel initializer of random normal noise instead of the normal glorot uniform.
  2. ) Discriminator — Passing real (target image) and the fake generated output of generator simultaneously by concatenating to discriminator, to check if real or fake. It acts as a critic.
  3. ) Loss function — Creating a loss function based on generator output and discriminator output.

But a normal GAN requires paired image i.e input and desired output, so that it can learn the features.

What I mean by paired and unpaired?

But we want to generate Bitmoji from just a single pic, that looks similar. For this we don’t have paired dataset. Plus we want the GAN to learn from itself i.e. unsupervised learning. So is there such a GAN?

Yes, there is. It’s called CycleGAN (there are other too but for this we will stick with CycleGAN).

CycleGAN

Basic architecture of CycleGAN

Imagine CycleGAN as having 2 networks. It consists of :

  • Two mappings G : X -> Y and F : Y -> X.
  • Corresponding Adversarial discriminators Dx and Dy.

Role of G: G is trying to translate X into outputs, which are fed through Dy to check whether they are real or fake according to Domain Y.

Role of F : F is trying to translate Y into outputs, which are fed through Dx to check if they are indistinguishable from Domain X.

Loss function

Cyclic consistency loss

Cyclic Consistency Loss is where the real power lies of CycleGAN. Just translating an image from one domain to another using GAN is not enough. A complete loop would be if the model can translate to another domain and if it revert it back to original state, then only one can say that the model has learned the true features for the task in hand. This is called cyclic loss.

This kind of loss uses the intuition that if we translate a sample from Domain X to Y using mapping function G and then map it back to X using function F, how close are we from arriving at the original sample. Similarly, it calculates the loss incurred by translating a sample from Y to X and then back again to Y. This cyclic loss should be minimised.

CycleGAN architecture

Generator of the GAN is basically an UNET with convolutional bottleneck blocks applied to latent space. For the latent space feature extraction, Resnet blocks were used.

The amount of blocks have a direct effect on the training time and results. For this case, experimentation was done on different depth of residual blocks. Best results were obtained when the resnet blocks were of depth 5 and further increasing had no improvement effect on the result, just the training time increased.

Results

The model was trained for 30 epochs and on total 8k images. The only drawback of GAN is that the training time is extremely humongous, thus only 30 epochs.

Good results

The results were good enough considering the limited data and the less training time. But all the results were not good.

The image translation sometimes unnecessarily converted men to women. Well it can be called as a feature if we want to ! But not in this case.

From left to right -- a.) Man becomes woman. b.) Just outlier failure

Another problem was accessories. The bitmoji dataset had no faces with accessories dataset. So the model considered anything on the head as hair be it caps, bandana or headphones. Also in some cases, shadow too became a problem, making the output bitmoji darker.

From left to right — a.) Assumes cap as hair. b.)Doesn’t work well with accessories.

Future work -

Using face point detection model to extract the face points or mesh and cascade it into the Generator for more robust and accurate results.

--

--

Sumeet Badgujar
Analytics Vidhya

A guy interested in Data Science and Ex-Machine Learning Engineer, doing data analysis and fun AI projects. “Ore wa Kaizoku Ou ni naru!”