Generating extinct Japanese script with Adversarial Autoencoders: Theory and Implementation

Adrian Yijie Xu, PhD
GradientCrescent
Published in
9 min readFeb 19, 2019

Introduction

Be it political deepfakes, near real-time video modification, or creating hybrid celebrity faces, the generative capabilities of neural networks have rapidly shot to the spotlight as we move beyond classical beliefs of “seeing is believing”. Amongst these, architectures utilizing adversarial training, such as adversarial networks such as Generative Adversarial Networks (GAN) and Adversarial Autoencoders (AAE), have received particular attention as a self-supervised approaches to creating realistic outputs capable of fooling other neural networks during classification.

Intuitively, we can think of adversarial architectures as a policeman and a counterfeiter working in tandem — the counterfeiter works to produce realistic quality counterfeit currency, which is examined by the policeman, whom we assume is knowledgeable about the characteristics of real currency due to experience . If it doesn’t pass as authentic currency, the policeman rejects the currency and instructs the counterfeiter how to improve next time.

Not what we’ll be doing, thankfully

As time passes, the counterfeiter’s skills increase, and eventually he produces authentic-looking currency. No worries however, we won’t be faking money today!

In this tutorial, we will utilize an Adversarial Autoencoder and the open-source KMNIST dataset in order to bring back a now relatively extinct form of ancient cursive Japanese script known as Kuzushiji. KMNIST (Kuzushiji-MNIST ) is a replacement for the MNIST dataset (28x28 grayscale, 70,000 images), provided in the original MNIST format as well as a NumPy format, with one class chosen to represent each of the 10 rows of Kuzushiji Hiragana.

Examples of each of the 10 classes of Hiragana in KMNIST

Before the 19th century, cursive Kuzushiji Japanese script had been in use as for over 1000 years before the 19th century, and included several different styles and formats for each word. However, during the Meiji period, Japan reformed its official language and writing system and standardized it into a form similar to that used today. This caused the cursive script to fade, and today millions of documents on Japanese culture and history cannot be read by modern scholars. Using our approach, we will breathe life into this extinct script, by learning from the work of thousands of poets and writers to create new examples of Kuzushijii script.

For the course of this tutorial, we assume that the user is familiar with the elements of deep learning, particularly backpropagation and the structure of one-dimensional neural networks.

Theory

An autoencoder is a self-supervised learning neural network designed to reconstruct its input from a lower dimensional representation. A vanilla autoencoder consists of two components — an encoder and a decoder. The encoder takes an input and generates a lower dimensional intermediate representation, known as the latent representation. For instance, an input consisting of an image of a cat may become a pixelated, smaller image that also occupies a proportionally smaller amount of memory.

The decoder’s role is to take this latent representation and restore it back into it’s original, higher dimensional form. It’s important to note that the decoder does not know what the original image looks like, and hence initially does a terrible job during reconstruction. We train the network by minimizing the reconstruction loss (the mean squared loss), which measures the difference between the original input and the reconstructed input.

Image result for mean squared loss equation
Definition of Mean Squared Error, over N samples.

With each epoch, the encoder will learn to generate a more meaningful latent representation, while decoder will learn to better convert the said latent representation back into an approximated input image. This continues until the reconstruction loss is minimized, and the input can be replicated. The uses of latent representations also lie beyond computational efficiency. By default, vanilla autoencoders don’t force their encoder posterior output to match a specific input distribution, but rather aim to learn explicit relationships which are spread in the lower dimensional latent space that aid in reconstruction. In this manner, an autoencoder can be used to remove ambient noise from an input image, for example.

An adversarial autoencoder (AAE) possesses similarities to vanilla autoencoders, but also contain an adversarial component known as a discriminator, similar to that observed in Generative Adversarial Networks. We can regard the autoencoder in an AAE as a generator, where the encoder component learns to convert the latent posterior distribution to match a prior input data distribution, while the decoder aims to better reconstruct the image to minimize reconstruction loss. The generator’s overall role is to produce outputs that could fool the discriminator into believing that the sample latent representation is coming from the true prior input distribution and not the latent posterior distribution.

There are two alternating phases to training an AAE. Firstly, during the reconstruction phase, we only train the autoencoder (which consists of the encoder and decoder) to minimize the reconstruction error, in the same manner as we observed with the vanilla autoencoders.

Secondly, during the regularization phase, we train the discriminator to tell apart samples from the true prior input distribution from the generated posterior the generated samples (encoder output). Following this, we then train the generator (which is the encoder of the autoencoder) to confuse the discriminator, by better matching the aforementioned distributions.

After the training is done, the decoder of the autoencoder will generate samples that directly map the imposed prior of the data distribution with minimal reconstruction loss.

Sounds complex right? To better understand the mechanics behind this, let’s go over a pass in the network. Let’s say that we are feeding our autoencoder with the KMNIST dataset, and we first pass it a batch of images representing the the class “0”.

Each image, represented as input tensors, is encoded into a latent representation which is then decoded by the autoencoder. After this, the reconstruction error is calculated and backpropagated in order to update its weights to minimize the aforementioned error.

The discriminator passes judgement over if the latent representation (or the encoder output) belongs to the true input distribution or not, by outputting a 1 if the discriminator believes the data is real, or a 0 if the data is fake. From these results a discriminative adversarial loss, characterized by a binary crossentropy error, the is generated and backpropagated in order to update its weights to better enhance it’s classification capabilities. Intuitively, we punish the discriminator should it mistake a true input image as a latent representation, or vice versa.

We train the generator’s encoder component by keeping the weights of the discriminator fixed and the target of the discriminator to 1, so that the encoder learns the required distribution by looking at the discriminator weights. After each generator training pass, we then fix the weights of the encoder, and train the discriminator. This coupling continues to alternate for the number of epochs that we have specified.

Implementation

Our code is based on Erik Lindenoren’s Keras implementation in Python. As previously mentioned, we will be using the KMNIST dataset for our generative example. So let’s load up the dataset to begin. As the images are compressed into .npz files, we will use numpy to load them into data arrays.

from keras.datasets import mnist
from keras.layers import Input, Dense, Reshape, Flatten, Lambda
from keras.layers.advanced_activations import LeakyReLU
from keras.models import Sequential, Model
from keras.optimizers import Adam
import keras.backend as K
import matplotlib.pyplot as plt
import numpy as np
import os
from PIL import Image

#For this project, we will only be using train_images
#To further improve the accuracy of the GAN, you could involve labels
PATH="../input/"
train_images = np.load(PATH+'kmnist-train-imgs.npz')['arr_0']
test_images = np.load(PATH+'kmnist-test-imgs.npz')['arr_0']
train_labels = np.load(PATH+'kmnist-train-labels.npz')['arr_0']
test_labels = np.load(PATH+'kmnist-test-labels.npz')['arr_0']

Let’s define some parameters here, specifying the size and color palette (greyscale) of our images, along with the input data batch size. The latent dim parameter represents the reduced dimensional size of the latent representation. Finally let’s plot a few of our images to see what kind of dataset we are dealing with. We’ll also define a sampling function to help us sample from our classes, which we approximate as Gaussian distributions.

img_rows = 28
img_cols = 28
channels = 1
img_shape = (img_rows, img_cols, channels)
latent_dim = 10 #10 classes and hence 10 dimensions
batch_size = 16
epsilon_std = 1.0

# View the dataset to get an idea of what we're dealing with
def plot_sample_images_data(images, labels):
plt.figure(figsize=(12,12))
for i in range(10):
imgs = images[np.where(labels == i)]
lbls = labels[np.where(labels == i)]
for j in range(10):
plt.subplot(10,10,i*10+j+1)
plt.xticks([])
plt.yticks([])
plt.grid(False)
plt.imshow(imgs[j], cmap=plt.cm.binary)
plt.xlabel(lbls[j])
plot_sample_images_data(train_images, train_labels)

def sampling(args):
z_mean, z_log_var = args
epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0., stddev=epsilon_std)
return z_mean + K.exp(z_log_var / 2) * epsilon
Examples of input distribution

Now, it’s time to define the architecture of our encoder, decoder, and discriminator. All three consist of densely connected layers with LeakyRelu activations. Notice that the discriminator works on encoded (latent) representations, not decoded images.

def build_encoder():
img = Input(shape=img_shape)
h = Flatten()(img)
h = Dense(512)(h)
h = LeakyReLU(alpha=0.2)(h)
h = Dense(512)(h)
h = LeakyReLU(alpha=0.2)(h)
mu = Dense(latent_dim)(h)
log_var = Dense(latent_dim)(h)
z = Lambda(sampling, output_shape=(latent_dim,), name='z')([mu, log_var])
return Model(img, z)
def build_decoder():
model = Sequential()
model.add(Dense(512, input_dim=latent_dim))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(512))
model.add(LeakyReLU(alpha=0.2))
# tanh is more robust: gradient not equal to 0 around 0
model.add(Dense(np.prod(img_shape), activation='tanh'))
model.add(Reshape(img_shape))
model.summary()
z = Input(shape=(latent_dim,))
img = model(z)
return Model(z, img)
def build_discriminator():

model = Sequential()
model.add(Dense(1024, input_dim=latent_dim))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(512))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(256))
model.add(LeakyReLU(alpha=0.2))
model.add(Dense(1, activation="sigmoid"))
model.summary()
encoded_repr = Input(shape=(latent_dim,))
validity = model(encoded_repr)
return Model(encoded_repr, validity)

Next, we build all of our components. We will use the ADAM to optimize the weights of our networks, with a learning rate of 0.02 %. For our initial pass, we train the generator on randomly initialized discriminator weights, so we fix the latter’s weights to facilitate this.

optimizer = Adam(0.0002, 0.5)


discriminator = build_discriminator()
discriminator.compile(loss='binary_crossentropy',
optimizer=optimizer,
metrics=['accuracy'])


encoder = build_encoder()
decoder = build_decoder()

img = Input(shape=img_shape)

encoded_repr = encoder(img)
reconstructed_img = decoder(encoded_repr)


discriminator.trainable = False

validity = discriminator(encoded_repr)

adversarial_autoencoder = Model(img, [reconstructed_img, validity])
adversarial_autoencoder.compile(loss=['mse', 'binary_crossentropy'], loss_weights=[0.999, 0.001], optimizer=optimizer)
adversarial_autoencoder.trainable =True

Finally, let’s define the training function for our adversarial autoencoder. Note how we load and normalize our dataset within our training function.

As previously mentioned, we only train the generator component, adversarial_autoencoder, on our first pass, aiming to minimize reconstruction error and generating a passable output for the discriminator (g_loss). Afterwards, we alternate to training the discriminator to correctly identify between real and generated latent representations (d_loss). The actual training is handled by Keras’s inbuilt train_on_batch() command.

def train(epochs, batch_size=128, sample_interval=50):
# Load the dataset
X_train =train_images

# Normalization: Rescale -1 to 1
X_train = (X_train.astype(np.float32) - 127.5) / 127.5
X_train = np.expand_dims(X_train, axis=3)

# Adversarial ground truths
valid = np.ones((batch_size, 1))
fake = np.zeros((batch_size, 1))

for epoch in range(epochs):
# Train Discriminator and Generator

# Select a random batch of images
idx = np.random.randint(0, X_train.shape[0], batch_size)
imgs = X_train[idx]

latent_fake = encoder.predict(imgs)
latent_real = np.random.normal(size=(batch_size, latent_dim))

d_loss_real = discriminator.train_on_batch(latent_real, valid)

d_loss_fake = discriminator.train_on_batch(latent_fake, fake)
d_loss = 1* np.add(d_loss_real, d_loss_fake)


g_loss = adversarial_autoencoder.train_on_batch(imgs, [imgs, valid])

# Plot the progress
if epoch % sample_interval == 0:

print("%d [D loss: %f, acc: %.2f%%] [G loss: %f, mse: %f]" % (
epoch, d_loss[0], 100 * d_loss[1], g_loss[0], g_loss[1]))
sample_images(epoch)

#Now that the intial training epoch has passed, we switch trainable roles
if(discriminator.trainable==False):
discriminator.trainable=True
adversarial_autoencoder.trainable=False
elif(discriminator.trainable==True):
discriminator.trainable=False
adversarial_autoencoder.trainable=True

With everything finished, let’s write a quick decoding command to save our outputs, and run our model!

# Save generated images per specified epochs 
def sample_images(epoch):
r, c = 5, 5
z = np.random.normal(size=(r * c, latent_dim))
gen_imgs = decoder.predict(z)
gen_imgs = 0.5 * gen_imgs + 0.5
fig, axs = plt.subplots(r, c)
cnt = 0
for i in range(r):
for j in range(c):
axs[i, j].imshow(gen_imgs[cnt, :, :, 0], cmap=plt.cm.binary)
axs[i, j].axis('off')
cnt += 1
fig.savefig("mnist_%d.png" % epoch)
plt.close()
epochs = 60000
sample_interval = 2000
sample_count = epochs/sample_interval
train(epochs=epochs, batch_size=batch_size, sample_interval=sample_interval)

Outputs

Let’s visualize the progress of our AAE across 60000 epochs. The training should take around 30 minutes, when run on a GPU-enabled Kaggle instance.

Epoch 6000
Epoch 28000
Epoch 58000

The character outputs become significantly more clear with more epochs, particularly the more detailed, complex characters. Recall, we are combining the styles of thousands of references in our generated outputs. Given the low resolution of our input images and the relatively short training time, this is a more than acceptable result.

So there you have it, we’ve breathed life back into a near extinct script using a very simple adversarial autoencoder network. We encourage you to play around with the weights and optimizers, and further improve upon our initial results.

Thank you to Arushi Goel for her valuable input.

References

Goel, Arushi

Lindenoren, Erik — GAN implementations in Keras.

Hubens, Nathan — Deep Inside: Autoencoders

Nagabushan, Naresh — A wizards’s guide to Adversarial Autoencoders

--

--