Reproducing Japanese Anime Styles With CartoonGAN AI

Published in
4 min readJul 14, 2018


From Hayao Miyazaki’s Spirited Away to Satoshi Kon’s Paprika, Japanese anime has made it okay for adults everywhere to enjoy cartoons again. Now, a team of Tsinghua University and Cardiff University researchers have introduced CartoonGAN — an AI-powered technology that simulates the styles of Japanese anime maestri from snapshots of real world scenery.

Anime has distinct aesthetics, and traditional manual transformation techniques for real world scenes require considerable expertise and expense, as artists must painstakingly draw lines and shade colours by hand to create high-quality scene reproductions.

A real-world train station scene (left) transformed to a cartoon-style picture (right).

Meanwhile, existing transformation methods based on non-photorealistic rendering (NPR) or convolutional neural networks (CNN) are also either time-consuming or impractical as they require paired images for model training. Moreover, these methods do not produce satisfactory cartoonization results, as (1) different cartoon styles have unique characteristics involving high-level simplification and abstraction, and (2) cartoon images tend to have clear edges, smooth color shading and relatively simple textures, which present challenges for the texture-descriptor-based loss functions used in existing methods.

CartoonGAN is a GAN framework composed of two CNNs which enables style translation between two unpaired datasets: a Generator for mapping input images to the cartoon manifold; and a Discriminator for judging whether the image is from the target manifold or synthetic. Residual blocks are introduced to simplify the training process.

To avoid slow convergence and obtain high-quality stylization, dedicated semantic content loss and edge-promoting adversarial loss functions and an initialization phase are integrated into this cartoonization architecture. The content loss is defined using the ℓ1 sparse regularization (instead of the ℓ2 norm) of VGG (Visual Geometry Group) feature maps between the input photo and the generated cartoon image.

An example of a Makoto Shinkai stylization shows the importance of each component in CartoonGAN: The initialization phase performs a fast convergence to reconstruct the target manifold; sparse regularization copes with style differences between cartoon images and real-world photos while retaining original contents, and the adversarial loss function creates the clear edges.

Changing components in the CartoonGAN loss function: (a) input photo, (b) without initialization process, © using ℓ2 regularization for content loss, (d) removing edge loss, (e) CartoonGAN result.

Both real-world photos and cartoon images are used for model training, while the test data contains only real-world pictures. All training images are resized to 256×256 pixels. Researchers downloaded 6,153 real-world pictures from Flickr, 5,402 of which were for training and the rest for testing. A total of 14,704 cartoon images from popular anime artists Makoto Shinkai, Mamoru Hosoda, Hayao Miyazaki, and Satoshi Kon were used for model training.

Compared to recently proposed CNN-based image transformation frameworks CycleGAN or Gatys et al’s Neural Style Transfer (NST) method, CartoonGAN more successfully reproduces clear edges and smooth shading while accurately retaining the input photo’s original content.

Because NST only uses a single stylization reference image for model training, it cannot deeply learn a particular anime style, especially when there are significant content differences between the stylization reference image and the input images. Improvements can be seen when more training data is introduced. However, even if a large collection of training data is used, stylization inconsistencies may appear between regions within the image.

Although the upgraded CycleGAN+Lidentity model’s identity loss function performs better on input photo content preservation, it is still unable to reproduce Makoto Shinkai or Hayao Miyazaki’s artistic styles as accurately as CartoonGAN does. Moreover, CartoonGAN’s processing time of 1617.69 s is 33 percent faster than CycleGAN and and 50 percent faster than CycleGAN plus Lidentity.

Comparison of CartoonGAN with other image transformation frameworks for Makoto Shinkai (top) and Hayao Miyazaki (bottom) styles.

The paper’s authors say they will focus on improving cartoon portrait stylization for human faces in their future research, while exploring applications for other image synthesis tasks with designed loss functions. The team also plans to extend the CartoonGan method to video stylization by adding sequential constraints to the training process.

The paper CartoonGAN: Generative Adversarial Networks for Photo Cartoonization was accepted by last month’s CVPR 2018 (Conference on Computer Vision and Pattern Recognition) in Salt Lake City, USA.

Source: Synced China

Localization: Tingting Cao | Editor: Meghan Han, Michael Sarazen

Follow us on Twitter @Synced_Global for more AI updates!

Subscribe to Synced Global AI Weekly to get insightful tech news, reviews and analysis! Click here !




AI Technology & Industry Review — | Newsletter: | Share My Research | Twitter: @Synced_Global