ThisPresidentDoesNotExist: Generating Artistic Presidential Portraits using Style-based Adversarial Networks and Transfer Learning: Theory and Implementation in Tensorflow

Introduction

Generative Adversarial Networks, or GANs, are a branch in deep learning that have gained public attention in the arts for their ability to generate new content in the style of existing data, through the use of an adversarial architecture where a generator component aims to fool a discriminator component trained on real-world data. More recently, NVIDIA’s StyleGAN received headlines worldwide through its impressive capabilities in generating realistic facsimile instances of existing images at extremely high resolutions while maintaining fast training speeds.. Following it’s initial publication in December 2018, the architecture has been used to create a series of websites termed “This [XXX] does not exist”, where [XXX] has represented people, cats, celebrities, or even AirBnB’s.

In this tutorial, we will utilize the “Trump Photos” dataset, available on Kaggle, together with a StyleGAN architecture to generate new representations of the 45th president of the United States, or in essence, access a multiverse of Trumps. The criteria behind making Donald Trump the subject of our study were purely non-political, and revolved around him being a universally relevant, recognized, and discussed persona in society today. To generate artistic outputs and to save on training time, we will utilize a pre-trained model from the Stylegan-art repository.

Access the multiverse!

We assume that the reader is familiar with deep learning, neural networks, and the fundamentals of adversarial architectures. For those wishing a quick review, we’ve covered the elements behind GANs in our previous publication concerning Adversarial Autoencoders, which provides a gentle introduction into the field, and literature on the topic is bountiful on Medium.

Theory

While a complete explanation of the StyleGAN architecture would go well beyond the length of a single tutorial, we will instead briefly discuss the improvements the NVIDIA team have made with this new approach over traditional adversarial architectures. A complete end-to-end description of the architecture can be grasped by consulting the original publications concerning ProGANs and StyleGANs. Let’s look at the overall architecture of a StyleGAN (below):

Traditional vs. StyleGAN architectures (Karras)

Unlike conventional GANs which are limited to generating low resolution images to improve training speed, training with these newer GAN architectures occurs in phases across increasing resolutions, with an initial low resolution image of 8x8 or 4x4, for StyleGAN and ProGAN’s respectively. As different feature details become prominent at each resolution (for example, while one’s skin tone may be discernible at low resolutions, hair textures and wrinkle details are only visible at much higher resolutions), this approach aids the network capturing and generating significantly more realistically varied images while reducing overall training time. StyleGANs are a derivative of the older ProGAN architecture (also by Karras et. al), sharing the same progressive training approach and same discriminator architecture. ProGAN architectures have been shown to generate high-quality images, but their ability to introduce local variation at each resolution is limited by design. More simply, attempting to change the input even by minute amounts results in variation being introduced to multiple features across different levels of detail, a phenomenon known as feature entanglement.

StyleGAN architectures utilize a novel method to control feature variation localization: while ProGAN inject noise into the beginning of the upscaling process to ensure output feature variation, StyleGAN keeps this up at each resolution step of the upscaling process, introducing randomness at all levels of detail. This allows for increased variation at all resolution levels of the generated output, reducing the risk of mode collapse. In addition, StyleGANs do not directly feed their inputs to the network, instead utilizing a separate densely-connected network to map latent representations to be fed into the network at each level of training. This approach has been shown to improve variation in generated images by reducing feature entanglement.

Now that we’ve briefly discussed our architectures, let’s move on to our implementation.

Implementation

Our code is based on the open-source Stylegan-art project which aims to generate stylistic portraits, which in turn relies on the vanilla Tensorflow implementation by the original NVIDIA team. All of our code is in Python and was run on Google’s own CodeLabs, which provides free GPU access for researchers (this turns out to be a double-edged sword, but more on that below!). All of it can be viewed on GitHub.

In the interests of time, we’ve decided to focus on the training-section of the implementation, written in Tensorflow. You’ll find that despite the difference in language syntax, the essential principles we’ve covered previously in Keras remain the same

To summarize, our implementation is structured as follows:

  • Clone the NVIDIA StyleGAN .git repo and a StyleGAN network pre-trained on artistic portrait data.
  • Download and normalize all of the images of the Donald Trump Kaggle dataset. We’ve used the gdown package to allow our script to download a copy during runtime, however users with local hardware setups may find it easier to pre-download a local copy
  • Run facial detection to ensure that our dataset contains visible facial expressions. To do this, we use the autocrop package, which filters through the dataset to separate examples with clear facial expressions from those that do not.
  • Train our facial data on the pre-trained network using Karras’s script. This is known as transfer learning.
  • Generate new instances of Donald Trump at regular training time intervals (ticks)

Transfer learning refers to a technique to drastically speed up network training times by utilizing a network previously trained on another dataset at different levels, leaving only certain higher-level details to be trained on the new dataset. It assumes that all similar data share some variations in features with each other. As our pre-trained network was previously trained on an extensive artistic portrait dataset, which contains a broad range of different human faces, we simply re-train the network on our Trump dataset at higher resolutions to capture details specific to the 45th President of the United States.

You’ll find the main training initialization script at the bottom of the Notebook. The initialization script initializes the training_loop.py file, while providing the arguments for the generator, discriminator, along with their training options and loss functions. Note that the same script is capable of switching between StyleGAN and ProGAN approaches, so be careful when modifying the code!

def main():
 kwargs = EasyDict(train)
kwargs.update(G_args=G, D_args=D, G_opt_args=G_opt, D_opt_args=D_opt, G_loss_args=G_loss, D_loss_args=D_loss)
kwargs.update(dataset_args=dataset, sched_args=sched, grid_args=grid, tf_config=tf_config)
kwargs.submit_config = copy.deepcopy(submit_config)
kwargs.submit_config.run_dir_root = dnnlib.submission.submit.get_template_from_path(config.result_dir)
kwargs.submit_config.run_dir_ignore += config.run_dir_ignore
kwargs.submit_config.run_desc = desc
dnnlib.submit_run(**kwargs)

The actual training is done within the training_loop.py file. In order to force transfer learning using a pre-trained network on our dataset, we modify and save it prior to running the initialization script. Let’s go over its details!

def process_reals(x, lod, mirror_augment, drange_data, drange_net):
 with tf.name_scope(‘ProcessReals’):
with tf.name_scope(‘DynamicRange’):
x = tf.cast(x, tf.float32)
x = misc.adjust_dynamic_range(x, drange_data, drange_net)
if mirror_augment:
with tf.name_scope(‘MirrorAugment’):
s = tf.shape(x)
mask = tf.random_uniform([s[0], 1, 1, 1], 0.0, 1.0)
mask = tf.tile(mask, [1, s[1], s[2], s[3]])
x = tf.where(mask < 0.5, x, tf.reverse(x, axis=[3]))
with tf.name_scope(‘FadeLOD’): # Smooth crossfade between consecutive levels-of-detail.
s = tf.shape(x)
y = tf.reshape(x, [-1, s[1], s[2]//2, 2, s[3]//2, 2])
y = tf.reduce_mean(y, axis=[3, 5], keepdims=True)
y = tf.tile(y, [1, 1, 1, 2, 1, 2])
y = tf.reshape(y, [-1, s[1], s[2], s[3]])
x = tflib.lerp(x, y, lod — tf.floor(lod))
with tf.name_scope(‘UpscaleLOD’): # Upscale to match the expected input/output size of the networks.
s = tf.shape(x)
factor = tf.cast(2 ** tf.floor(lod), tf.int32)
x = tf.reshape(x, [-1, s[1], s[2], 1, s[3], 1])
x = tf.tile(x, [1, 1, 1, factor, 1, factor])
x = tf.reshape(x, [-1, s[1], s[2] * factor, s[3] * factor])
 return x

The first section you’ll notice is the auxiliary method process_reals(), defined for providing just-in-time processing of images for inputs for the network. This method handles data augmentation processing, including mirror flipping, detail crossfading, and detail upscaling, all of which improving the robustness of our model. This is especially true when using constrained datasets. In addition, the use of JIT processing serves to drastically reduce the amount of computing power and memory needed for training.

Next you’ll see the training_schedule() method.

def training_schedule(
cur_nimg,
training_set,
num_gpus,
lod_initial_resolution = 4, # Image resolution used at the beginning.
lod_training_kimg = 600, # Thousands of real images to show before doubling the resolution.
lod_transition_kimg = 600, # Thousands of real images to show when fading in new layers.
minibatch_base = 16, # Maximum minibatch size, divided evenly among GPUs.
minibatch_dict = {}, # Resolution-specific overrides.
max_minibatch_per_gpu = {}, # Resolution-specific maximum minibatch size per GPU.
G_lrate_base = 0.001, # Learning rate for the generator.
G_lrate_dict = {}, # Resolution-specific overrides.
D_lrate_base = 0.001, # Learning rate for the discriminator.
D_lrate_dict = {}, # Resolution-specific overrides.
lrate_rampup_kimg = 0, # Duration of learning rate ramp-up.
tick_kimg_base = 160, # Default interval of progress snapshots.
tick_kimg_dict = {4: 160, 8:140, 16:120, 32:100, 64:80, 128:60, 256:40, 512:30, 1024:20}): # Resolution-specific overrides.
# Initialize result dict.
s = dnnlib.EasyDict()
s.kimg = cur_nimg / 1000.0
# Training phase.
phase_dur = lod_training_kimg + lod_transition_kimg
phase_idx = int(np.floor(s.kimg / phase_dur)) if phase_dur > 0 else 0
phase_kimg = s.kimg — phase_idx * phase_dur
# Level-of-detail and resolution.
s.lod = training_set.resolution_log2
s.lod -= np.floor(np.log2(lod_initial_resolution))
s.lod -= phase_idx
if lod_transition_kimg > 0:
s.lod -= max(phase_kimg — lod_training_kimg, 0.0) / lod_transition_kimg
s.lod = max(s.lod, 0.0)
s.resolution = 2 ** (training_set.resolution_log2 — int(np.floor(s.lod)))
# Minibatch size.
s.minibatch = minibatch_dict.get(s.resolution, minibatch_base)
s.minibatch -= s.minibatch % num_gpus
if s.resolution in max_minibatch_per_gpu:
s.minibatch = min(s.minibatch, max_minibatch_per_gpu[s.resolution] * num_gpus)
# Learning rate.
s.G_lrate = G_lrate_dict.get(s.resolution, G_lrate_base)
s.D_lrate = D_lrate_dict.get(s.resolution, D_lrate_base)
if lrate_rampup_kimg > 0:
rampup = min(s.kimg / lrate_rampup_kimg, 1.0)
s.G_lrate *= rampup
s.D_lrate *= rampup
# Other parameters.
s.tick_kimg = tick_kimg_dict.get(s.resolution, tick_kimg_base)
 return s

This defines many of the parameters needed when training a model from scratch, including the initial resolution (defined here as 4x4), the detail level, the number of images trained before doubling of training resolution (defined here as 600 000), minibatch-size constraints, active learning rates, and snapshot interval durations. As we are utilizing transfer learning on a pre-trained network (and hence training at high resolutions), there’s no need to modify any of these parameters now.

Finally, we come to the actual training_loop() method, responsible for the actual preparation, construction, and training of our networks. You’ll see multiple variables initialized and parameters defined here, but there are two specific to transfer learning we must define:

  • The network_snapshot_ticks parameter determines the frequency of saving network pickle files — as each of these is around 300 mb, they must be used with moderation.
  • The image_snapshot_ticks parameter determines the frequency of generation of sample image grids for inspection and progress tracking.
  • The resume_kimg parameter tracks training progress in time, and must be set to a high value in order to ensure that training commences at high resolutions on our Trump dataset.

After loading our datasets, the method builds our network components in accordance to the arguments supplied in the initialization script.

It’s worth mentioning that if a snapshot image of the network is available, the network components can be pre-loaded and training can be resumed from that point. This is a necessity for transfer learning- hence we’ve hard-coded the system to utilize the pre-trained network, defined by “ /network-snapshot-011155.pkl”.

# Load training set.
training_set = dataset.load_dataset(data_dir=config.data_dir, verbose=True, **dataset_args)
# Construct networks.
with tf.device(‘/gpu:0’):
if resume_run_id is not None:
network_pkl = misc.locate_network_pkl(resume_run_id, resume_snapshot)
print(‘Loading networks from “%s”…’ % network_pkl)
G, D, Gs = misc.load_pkl(network_pkl)
else:
print(‘Constructing networks…’)
G = tflib.Network(‘G’, num_channels=training_set.shape[0], resolution=training_set.shape[1], label_size=training_set.label_size, **G_args)
D = tflib.Network(‘D’, num_channels=training_set.shape[0], resolution=training_set.shape[1], label_size=training_set.label_size, **D_args)
Gs = G.clone(‘Gs’)
G.print_layers(); D.print_layers()

Next, we build the Tensorflow graph covering the entire architecture. Think of it as a map showing the relationships between your variables and your methods. Graphs allow you to run separate sections of your computations independently and across several platforms, meaning that you could potentially train separate sections online and offline, if desired.

print(‘Building TensorFlow graph…’)
with tf.name_scope(‘Inputs’), tf.device(‘/cpu:0’):
    lod_in = tf.placeholder(tf.float32, name=’lod_in’, shape=[])
lrate_in = tf.placeholder(tf.float32, name=’lrate_in’, shape=[])
minibatch_in = tf.placeholder(tf.int32, name=’minibatch_in’, shape=[])
minibatch_split = minibatch_in // submit_config.num_gpus
Gs_beta = 0.5 ** tf.div(tf.cast(minibatch_in, tf.float32), G_smoothing_kimg * 1000.0) if G_smoothing_kimg > 0.0 else 0.0

We define the Tensorflow optimizers for both the generator and discriminator components using the arguments supplied in the initialization scripts. These are used for training and gradient descent. We iterate through the available GPUs, local or otherwise, to assign different variables for maximum training efficiency. Once finished, we supply a minibatch of JIT-preprocessed data for the training process.

Next, We define our gradients of the loss functions for the optimization process with regards to our variables (defined internally by the trainables parameter). During gradient descent (defined in G_train_op and D_train_op for the generator, and discriminator components, respectively), these gradients are then iteratively minimized to produce the variable combinations via the apply_updates() method, that produce our ideal converged outputs.

G_opt = tflib.Optimizer(name=’TrainG’, learning_rate=lrate_in, **G_opt_args)
D_opt = tflib.Optimizer(name=’TrainD’, learning_rate=lrate_in, **D_opt_args)
for gpu in range(submit_config.num_gpus):
    with tf.name_scope(‘GPU%d’ % gpu), tf.device(‘/gpu:%d’ % gpu):
       G_gpu = G if gpu == 0 else G.clone(G.name + ‘_shadow’)
D_gpu = D if gpu == 0 else D.clone(D.name + ‘_shadow’)
lod_assign_ops = [tf.assign(G_gpu.find_var(‘lod’), lod_in), tf.assign(D_gpu.find_var(‘lod’), lod_in)]
reals, labels = training_set.get_minibatch_tf()
reals = process_reals(reals, lod_in, mirror_augment, training_set.dynamic_range, drange_net)
    with tf.name_scope(‘G_loss’), tf.control_dependencies(lod_assign_ops):
G_loss = dnnlib.util.call_func_by_name(G=G_gpu, D=D_gpu, opt=G_opt, training_set=training_set, minibatch_size=minibatch_split, **G_loss_args)
    with tf.name_scope(‘D_loss’), tf.control_dependencies(lod_assign_ops):
D_loss = dnnlib.util.call_func_by_name(G=G_gpu, D=D_gpu, opt=D_opt, training_set=training_set, minibatch_size=minibatch_split, reals=reals, labels=labels, **D_loss_args)
    G_opt.register_gradients(tf.reduce_mean(G_loss),    G_gpu.trainables)
D_opt.register_gradients(tf.reduce_mean(D_loss), D_gpu.trainables)
 G_train_op = G_opt.apply_updates()
D_train_op = D_opt.apply_updates()
Gs_update_op = Gs.setup_as_moving_average_of(G, beta=Gs_beta)
with tf.device('/gpu:0'):
try:
peak_gpu_mem_op = tf.contrib.memory_stats.MaxBytesInUse()
except tf.errors.NotFoundError:
peak_gpu_mem_op = tf.constant(0)

Finally, we define some auxiliary functions to export generated image snapshots, which provide us with an idea of training progress through a set of generated images in grid format after each epoch (controlled by the image_snapshot_ticks variable mentioned previously).

print(‘Setting up snapshot image grid…’)
grid_size, grid_reals, grid_labels, grid_latents = misc.setup_snapshot_image_grid(G, training_set, **grid_args)
sched = training_schedule(cur_nimg=total_kimg*1000, training_set=training_set, num_gpus=submit_config.num_gpus, **sched_args)
grid_fakes = Gs.run(grid_latents, grid_labels, is_validation=True, minibatch_size=sched.minibatch//submit_config.num_gpus)
print(‘Setting up run dir…’)
misc.save_image_grid(grid_reals, os.path.join(submit_config.run_dir, ‘reals.png’), drange=training_set.dynamic_range, grid_size=grid_size)
misc.save_image_grid(grid_fakes, os.path.join(submit_config.run_dir, ‘fakes%06d.png’ % resume_kimg), drange=drange_net, grid_size=grid_size)
summary_log = tf.summary.FileWriter(submit_config.run_dir)
 if save_tf_graph:
summary_log.add_graph(tf.get_default_graph())
if save_weight_histograms:
G.setup_weight_histograms(); D.setup_weight_histograms()
metrics = metric_base.MetricGroup(metric_arg_list)

With all of that finished, it’s finally time to begin training. Our network was trained on Google Colabs which took an incredibly 6 hours per tick to train (and this only covered 4 minibatches of data!). This has been reported before (gwern), and hence to produce truly realistic images a dedicated GPU hardware setup would be required. However, let’s look at the results after a single tick of 6 hours worth of training!

Training is done separately for the discriminator and the generator, with both being trained across a separate number of iterations per epoch, controlled by the D_repeats and minibatch_repeats parameters, respectively.

# Run training ops.
for _mb_repeat in range(minibatch_repeats):
for _D_repeat in range(D_repeats):
tflib.run([D_train_op, Gs_update_op], {lod_in: sched.lod, lrate_in: sched.D_lrate, minibatch_in: sched.minibatch})
cur_nimg += sched.minibatch
    tflib.run([G_train_op], {lod_in: sched.lod, lrate_in: sched.G_lrate, minibatch_in: sched.minibatch})

You’ll notice that we keep track of progress through regular tick-based reports, snapshots (snippet below), summaries, and checkpoint networks. Once the network has been trained to the desired number of epochs, defined by the tick time and total_kimg variables, we would save our final network.

# Save snapshots.
if cur_tick % image_snapshot_ticks == 0 or done:
grid_fakes = Gs.run(grid_latents, grid_labels, is_validation=True, minibatch_size=sched.minibatch//submit_config.num_gpus)
    misc.save_image_grid(grid_fakes, os.path.join(submit_config.run_dir, ‘fakes%06d.png’ % (cur_nimg // 1000)), drange=drange_net, grid_size=grid_size)
 if cur_tick % network_snapshot_ticks == 0 or done or cur_tick == 1:
pkl = os.path.join(submit_config.run_dir, ‘network-snapshot-%06d.pkl’ % (cur_nimg // 1000))
misc.save_pkl((G, D, Gs), pkl)
metrics.run(pkl, run_dir=submit_config.run_dir, num_gpus=submit_config.num_gpus, tf_config=tf_config)

As mentioned previously, we’ll be focusing on the results after a single tick (6 hours). Let’s take a look at the generated image snapshot grid:

Beautiful.

Our results definitely possess an artistic flair, ranging from classical to impressionism being demonstrated With more training time, even more varied examples could be accessed. As usual, the code can be found on Github

Personally, I like to think that we’ve accessed the Presidents of alternative dimensions! One things for sure, our generated Presidents certainly have style.

Forgive me, for the hour is late

This won’t be the end of our StyleGAN journey — we’ll be applying StyleGANs to social networks , so stay tuned!

Sources

Gwern, Making Anime Faces with StyleGAN

Karras et al., A Style-Based Generator Architecture for Generative Adversarial Networks

ak9250, Stylegan-art repository

Horev, Explained: A Style-Based Generator Architecture for GANs — Generating and Tuning Realistic Artificial Faces