NeuroNuggets: Deep Anime

Neuromation
Neuromation
Published in
9 min readApr 23, 2019

Last time, we had some very serious stuff to discuss. Let’s touch upon a much lighter topic today: anime! It turns out that many architectures we’ve discussed on this very blog, or plan to discuss in more detail in the future, have already been applied to Japanese-style comics and animation.

Let me start by giving a shout out to the owner of this fantastic github repository. It is the most comprehensive resource for all things anime in deep learning. Thanks for putting this together and maintaining it, whoever you are!

We will be mostly talking about generating anime characters, but the last part will be a brief overview of some other anime-related problems.

Do everything by hand, even when using the computer.

Hayao Miyazaki

Drawing Anime Characters with GANs

Guess who drew the characters you saw above? You guessed right, there was no manga artist who thought them up, they were drawn automatically with a generative model.

The paper by Jin et al. (2017) presents an architecture based on generative adversarial models trained to generate anime characters. We have spoken about GANs several times on this blog (see, e.g., here or here), and this sounds like a relatively straightforward application. But attempts at direct applications of basic GAN architectures such as DCGAN for this problem, even a relatively successful attempt called (unsurprisingly) AnimeGAN, produced only low-resolution, blurry and generally unsatisfactory images, e.g.:

Image source: https://github.com/jayleicn/animeGAN

How did Jin et al. bridge the gap between this and what we saw above?

First, let’s talk about the data. This work shows a good example of a general trend: dataset collection and especially labeling increasingly becomes an automated or at least semi-automated process, using models that we believe to work reliably in order to label datasets for more complex models.

To get a big collection of anime character faces, Jin et al. scraped the Getchu website that showcases thousands of Japanese games, including unified presentations of their characters, in good quality and on neutral background:

Image source: https://arxiv.org/pdf/1708.05509.pdf

On these pictures, they ran a face detection model called lbpcascade, specifically trained to do face detection for anime/manga, and then enlarged the resulting bounding box (shown in red above) by 1.5x to add some context (shown in blue above). To add the “semi-” to “semi-automated”, the authors also checked the resulting 42000 images by hand and removed about 4% of false positives They don’t show a comparison but I’m sure this was an important step for data preparation.

But that’s not all. Jin et al. wanted to have conditional generation, where you would be able to get a blonde anime girl with a ponytail or a brown-eyed red-haired one with glasses. To do that, they ran a pretrained model called Illustration2Vec which is designed to predict a large number of predefined tags from an anime/manga image. Here is a sample:

Image source: https://github.com/rezoo/illustration2vec

Jin et al. chose suitable thresholds for the classifiers in Illustration2Vec, but basically they used this pretrained model as is, relying on its accuracy to create the training set for the GAN. This is an interesting illustration to how you can bootstrap training sets from pretrained models: it won’t always work but when it does, it can produce large training sets very efficiently. As a result, they now had a large dataset of images labeled with various tags, with a feature vector associated with every image. Here is a part of this dataset in tSNE visualization of the feature vectors:

Image source: https://arxiv.org/pdf/1708.05509.pdf

The next step would be to choose the GAN architecture. Jin et al. went with DRAGAN (Deep Regret Analytic Generative Adversarial Networks), an additional loss function suggested by Kodali et al. (2017) to alleviate the mode collapse problem. We will not go into further details on DRAGAN here. Suffice it to say that the final architecture is basically a standard GAN with a generator and a discriminator, and with some additional loss functions to account for the DRAGAN gradient penalty and for the correct assignment of class labels to make it conditional. The architectures for both generator and discriminator are based on SRResNet, pretty standard convolutional architectures with residual connections.

So now we have both the data and the architecture. Then we train for a while, and then we generate!

Image source: https://arxiv.org/pdf/1708.05509.pdf

Those were the results of unconditional generation, but we can also set up some attributes as conditions. Below, on the left we have the “aqua hair, long hair, drill hair, open mouth, glasses, aqua eyes” tags and on the right we have “orange hair, ponytail, hat, glasses, red eyes, orange eyes”:

Image source: https://arxiv.org/pdf/1708.05509.pdf

And, even better, you can play with this generative model yourself. Jin et al. made the models available through a public frontend at this website; you can specify certain characteristic features and generate new anime characters automatically. Try it!

Next Step: Full-Body Generation with Pose Conditions

Hamada et al. (2018) take the next step in generating anime characters with GANs: instead of just doing a head shot like Jin et al. above, they generate a full-body image with a predefined pose. They are using the basic idea of progressively growing GANs from Karras et al. (2018), a paper that we have actually already discussed in detail on this very blog:

  • begin with training a GAN to generate extremely small images, like 4x4 pixels;
  • use the result as a condition to train a GAN that scales it up to 8x8 pixels, a process similar to superresolution;
  • use the result as a condition to train a GAN that scales it up to 16x16 pixels…
  • …and so on until you get to 1024x1024 or something like that.

The novel idea by Hamada et al. is that you can also use the pose as a condition, first expressing it in the form of a pixel mask and then scaling it down to 4x4 pixels, then 8x8, and so on:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

Then they created a dataset of full-body high-resolution anime characters based on the Unity 3D models for various poses:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

And as a result, the progressive structure-conditional GAN is able to generate nice pictures with predefined poses. As usual with GANs you can interpolate between characters while keeping the pose fixed, and you can produce different poses of the same character, which makes this paper a big step towards developing a tool that would actually help artists and animators. Here is a sample output:

Image source: https://arxiv.org/pdf/1809.01890v1.pdf

Even Better: StyleGAN for Anime

Have you seen thispersondoesnotexist.com? It shows fake people generated by the latest and greatest GAN-based architecture for face generation, the StyleGAN, and it’s been all over the Web for a while.

Well, turns out there is an anime equivalent! thiswaifudoesnotexist.net generates random anime characters with the StyleGAN architecture and even adds a randomly generated plot summary! Like this:

Image source: https://www.thiswaifudoesnotexist.net/

Looks even better! But wait, what is this StyleGAN we speak of?

StyleGAN is an architecture by NVIDIA researchers Karras et al. (2018), the same group who had previously made progressively growing GANs. This time, they kept the stack of progressive superresolution but changed the architecture of the basic convolutional model, making the generator similar to style transfer networks. Essentially, instead of simply putting a latent code through a convolutional network, like traditional GANs do, StyleGAN first recovers an intermediate code vector and then uses it several times to inform the synthesis network, with external noise coming in at every level. Here is a picture from the paper, with a traditional generator architecture on the left and StyleGAN on the right:

Image source: https://arxiv.org/pdf/1812.04948

We won’t go into more detail on this here, as the StyleGAN would deserve a dedicated NeuroNugget to explain fully (and maybe it’ll get it). Safe to say that the final result now looks even better. StyleGAN defines a new gold standard for face generation, as shown on thispersondoesnotexist.com and now, as we can see, on thiswaifudoesnotexist.net. As for the text generation part, this is a completely different can of worms, awaiting its own NeuroNuggets, quite possibly in the near future…

Brief Overviews

Let us close with a few more papers that solve interesting anime/manga-related problems.

Style transfer for anime sketches. We’ve spoken of GANs that use ideas similar to style transfer, but what about style transfer itself? Zhang et al. (2017) present a style transfer network based on U-Net and auxiliary classifier GAN (AC-GAN) that can fill in sketches with color schemes derived from separate (and completely different) style images. This solves a very practical problem for anime artists: if you can draw a character in full color once and then just apply the style to sketches, it would be a huge saving of effort. We are not quite there yet, but look at the results; in the three examples below, the sketch shown in the top left is combined with a style image shown in the bottom left to get the final image:

Image source: https://arxiv.org/pdf/1706.03319v2.pdf

Interactive segmentation. Ito et al. (2016) propose an interactive segmentation method intended for manga. An important problem for manga illustrators would be to have automated or semi-automated segmentation tools, so they can cut out individual characters or parts of the scene from existing drawings. That’s exactly what Ito et al. do (without any deep learning, by the way, by improving classical segmentation techniques):

Image source: https://projet.liris.cnrs.fr/imagine/pub/proceedings/ICPR-2016/media/files/0660.pdf

Anime superresolution. We have already mentioned superresolution as a stepping stone in progressively growing GANs, but one can also use it directly to transform small and/or low-res images to high-quality anime. The waifu2x model is a model based on SRCNN (single-image superresolution based on convolutional neural networks) that is a little bit fine-tuned and extensively pre-trained to handle anime. The results are actually pretty impressive — here is how waifu2x works:

Image source: https://github.com/nagadomi/waifu2x

Conclusion

Cartoons in general and anime in particular represent a very nice domain for computer vision:

  • images in anime style are much simpler than real-life photos: the edges are pronounced, the contours are mostly perfectly closed, many shapes have a distinct style that makes them easy to recognize, and so on;
  • there is a natural demand for the tools to manipulate anime images from anime artists, animators, and enthusiasts;
  • as we have seen even in this post, there exist large databases of categorized and tagged images that can be scraped for datasets.

So no wonder people have been able to make a lot of things work well in the domain of anime. Actually, I would expect that anime might become an important frontier for image manipulation models, a sandbox where the models work well and do cool things before they can “graduate” to realistic photographic imagery.

But to do that, the world needs large-scale open datasets and a clear formulation of the main problems in the field. Hopefully, the anime community can help with that, and then I have no doubt researchers all over the world will jump in… not only because it might be easier than real photos, but also simply because it’s so cool. Have fun!

Sergey Nikolenko
Chief Research Officer, Neuromation

--

--