Into the Cageverse — Deepfaking with Autoencoders: An Implementation in Keras and Tensorflow

Published in

GradientCrescent

10 min readJul 22, 2019

This is the second part of our special feature series on Deepfakes, exploring the latest developments and implications in this nascent field of AI. We will be covering improved implementations on generation and countering strategies in future articles, please stay tuned to GradientCrescent to learn more.

Introduction

Deepfakes have rapidly become a mainstream topic of discussion, filled with mixture of hype, anticipation, and fear. The recent release of polished examples such as the Shining deepfake video, featuring Jim Carrey’s likeness on Jack Nicholson’s performance, has aroused attention into the intricacies behind deepfake generation.

Heeere’s Jimmy!

But how are deepfakes made? Are they resource intensive? Do they require extensive data and computing power to be made? Let’s explore this topic by generating our own deepfake! We’ll be generating simple facial transfer deepfake of two Marvel superheroes, Spiderman (played by Tom Holland) and Thor (played by Chris Hemsworth), replacing their facial likeness with that of actor Nicolas Cage. This is not so far-fetched an idea — he was in consideration to play Superman during the Burton reboot in the 1990s.

This man deserves a chance. Let’s help Mr. Cage audition for Phase 4 of the Marvel Universe.

Before we begin, we recommend that the reader consult our previous articles on deepfakes and autoencoders for a deeper understanding of the theory and characteristics behind deepfake generation, but the essence behind deepfake generation is best illustrated with the following diagram

Demonstration of foreign decoder use in DeepFake generation (Oberov et al.)

Given a reference and a target set of facial images, we can train a pair of autoencoders — one for target-target reconstruction and one for reference-reference reconstruction. The autoencoder consists of an encoder and decoder component, where the autoencoder’s role is to learn to encode an input image into a lower-dimensional representation, while the decoder’s role is to reconstruct this representation back into the original image. This encoded representation contains pattern details related to the facial characteristics of the original image, such as expression.

To generate a deepfake, the encoded representation of the target is fed into the decoder of the reference, resulting in the generation of an output featuring the reference’s likeness but with the characteristics of the target.

(Note that while deepfakes are commonly associated with video data, they are generally trained and generated frame-by-frame before the results are spliced together into a working clip. Hence, static deepfakes are sufficient for demonstration purposes)

Naturally, feel free to flesh out the rest of the Avengers into screen tests for Mr. Cage

We assume that the user is familiar with Deep Learning, specifically the convolutional neural networks. All of our code can be found on our Github repository, and was written in Python using the Keras and Tensorflow libraries.

Implementation

Our implementation relies on a simple autoencoder without the use of a GAN component, and is based on a simplified implementation of the FaceSwap-GAN repository by Lu et al., which we’ve optimized for cloud-based environments. While the addition of a GAN to discriminate between real and generated outputs would undoubtedly improve the performance of our model, a simple autoencoder approach is sufficient to demonstrate the elements of deepfake generation.

We’ve split our code into three notebooks, covering data preprocessing, model training, and deepfake generation. Before we start, it’s important to note that we always start by importing an auxiliary lib_1 library, containing several helper files covering aspects of image processing libraries such as CV2 and face_recognition. Briefly, their duties cover image processing aspecets such as face detection, extraction, and manipulation.

Each of these topics could warrant its own article on their own right, and so a detailed treatment is omitted in this article. However, we will briefly summarize the role of critical components when we encounter them within each notebook.

Let’s cover each notebook in sequence.

Preprocessing

In order to train our autoencoders effectively, all of our data must be roughly of the same facial area. Hence, we must first extract faces from each directory of images using a reference filter of the actor of interest. This is primary purpose of Notebook 1.

Our dataset was generated using a batch download script to pull 500 images of each actor into three compressed files, which serve as our initial input.

After importing all of our auxiliary libraries and raw images into the instance, we begin by defining the input and output directories.

input_directory=”../content/chris/” 
#TODO Change argument here of the input data, should be either chris, tom, or nicolasoutput_directory=”../content/extracted/”

Next, we define the methods to load in our reference filters and to extract faces using the said reference filters.

def load_filter():
 filter_file = ‘../content/filter/chrisfilter.jpg’ # TODO Change argument here depending on what youre trying to extract
 if os.path.exists(filter_file):
    print(‘Loading reference image for filtering’)
    return FaceFilter(filter_file)
 else:
    print(“Filter not detected”)def get_faces(image):
 faces_count = 0
 filterDeepFake = load_filter()
 
 for face in detect_faces(image):
 
    if filterDeepFake is not None and not filterDeepFake.check(face):
    print(‘Skipping not recognized face!’)
    continue    yield faces_count, face

Per the FaceFilter class within the auxiliary library, each filter instance returns a lower-dimensional, encoded version of the reference image provided of the actor of interest. We then systematically compare this encoded filter to the images within our dataset, using a set threshold to generate a balance between identifying false positives and false negatives.

Once the faces are detected within an image, we then mark it for extraction into our output folder. Our extraction method is a simple cropping function, involving no tracking of facial landmarks. Utilization of facial landmarks would generate a more consistent result, but a simple crop was judged sufficient for this demonstration. For a more detailed understanding of the processes involved in facial detection, we highly recommend to consult the Face_Recognition repository.

try:
 for filename in folder_img:
 
 
    image = cv2.imread(filename)

 
    for idx, face in get_faces(image):
       resized_image = extractor.extract(image, face, 256)
       output_file = output_directory+”/”+str(Path(filename).stem)
       cv2.imwrite(str(output_file) + str(idx) + Path(filename).suffix, resized_image)except Exception as e:
    print(‘Failed to extract from image: {}. Reason: {}’.format(filename, e))

This process was repeated for all three actors, with the outputs extracted into compressed files for the next step, model training.

Training

This notebook covers the training of our autoencoder models on the extracted facial data of our actors. A pair of autoencoders are trained at once, with one always trained on the Nicolas Cage dataset. Note that for simplicity, the two models share a single encoder component.

To speed up our training time given the small amount of resources on Colaboratory, we implement transfer learning by downloading weights of the encoder and two decoders from the FaceSwap repository. This will speed up our convergence time, and allow us to generate optimal results using small fine-tuning with small learning rates. Once that’s finished, we define a few parameters for our network. These cover the location of data, our pre-trained weights, as well as the dimensions of the final densely connected layer of the encoder.

sav_Model=”../content/saved_model/”
pretrained_weight=”../content/weight”
image_actor_A_directory=”../content/chris” #ORIGINAL
image_actor_B_directory=”../content/nic” #TARGET TO REPLACE WITH
batch_size=1
save_interval=100
ENCODER_DIM = 1024
image_extensions = [“.jpg”, “.jpeg”, “.png”]
encoderH5 = ‘/encoder.h5’
decoder_AH5 = ‘/decoder_A.h5’
decoder_BH5 = ‘/decoder_B.h5’
IMAGE_SHAPE = (64, 64, 3)

With our variables defined, let’s begin to build our autoencoder model by defining an inner class.

class dfModel():
 def __init__(self):    self.model_dir = sav_Model
    self.pretrained_weight=pretrained_weight
    self.encoder = self.Encoder()
    self.decoder_A = self.Decoder()
    self.decoder_B = self.Decoder()    self.initModel()def initModel(self):
 optimizer = Adam(lr=5e-5, beta_1=0.5, beta_2=0.999) #orig adam 5e-5
 x = Input(shape=IMAGE_SHAPE) self.autoencoder_A = KerasModel(x, self.decoder_A(self.encoder(x)))
 self.autoencoder_B = KerasModel(x, self.decoder_B(self.encoder(x)))
 print(self.encoder.summary())
 print(self.decoder_A.summary()) self.autoencoder_A.compile(optimizer=optimizer, loss=’mean_absolute_error’)
 self.autoencoder_B.compile(optimizer=optimizer, loss=’mean_absolute_error’)def converter(self, swap):
 autoencoder = self.autoencoder_B if not swap else self.autoencoder_A 
 return lambda img: autoencoder.predict(img)def conv(self, filters):
    def block(x):
       x = Conv2D(filters, kernel_size=5, strides=2, padding=’same’)(x)
       x = LeakyReLU(0.1)(x)
       return x
    return blockdef upscale(self, filters):
 def block(x):
    x = Conv2D(filters * 4, kernel_size=3, padding=’same’)(x)
    x = LeakyReLU(0.1)(x)
    x = PixelShuffler()(x)
 #Pixelshufflers job here is analoguous to upsampling2d
   return x
 return block

Besides defining the initialization methods, we’ve grouped convolutional layers with auxilliary layers together in blocks within the conv and upscale methods. This is done to reduce clutter within our code. The PixerShuffler() method is another auxilliary library helper function that replaces the commonly seen upsampling function in Keras, which repeats rows and columns by a certain size to increase the dimensions of the output.

As we are building a pure autoencoder model, you’ll notice that we are using mean absolute error here as a loss function, which covers pixel-to-pixel differences between the original inputs and generated outputs.

Next, the key components of our autoencoder are introduced through the Encoder() and Decoder() methods.

def Encoder(self):
 input_ = Input(shape=IMAGE_SHAPE)
 x = input_
 x = self.conv(128)(x)
 x = self.conv(256)(x)
 x = self.conv(512)(x)
 x = self.conv(1024)(x)
 x = Dense(ENCODER_DIM)(Flatten()(x))
 x = Dense(4 * 4 * 1024)(x)
 #Passed flattened X input into 2 dense layers, 1024 and 1024*4*4
 x = Reshape((4, 4, 1024))(x)
 #Reshapes X into 4,4,1024
 x = self.upscale(512)(x)
 return KerasModel(input_, x)def Decoder(self):
 input_ = Input(shape=(8, 8, 512))
 x = input_
 x = self.upscale(256)(x) #Actually 1024 given filters*4
 x = self.upscale(128)(x) #Actually 512
 x = self.upscale(64)(x) #Actually 256
 x = Conv2D(3, kernel_size=5, padding=’same’, activation=’sigmoid’)(x)
 return KerasModel(input_, x)

Let’s take a look at the summary of the model to keep track of the dimensions of our input.

Briefly, the encoder consists of successive pairs of convolutional + LeakyRelu activation layers featuring increasing numbers of 5 x 5 filters (doubling each layer) generating smaller activation maps, with a stride value of 2 reducing the map size by half after each layer.

The resulting activation map is then reshaped into a one — hot vector representation of the image, before feeding the flattened output vector into a densely connected layer, which then holds the latent information or encoding of the image. Uniquely, the deepfake algorithm has us convert this encoding back into a 2D warped image of the subject for the decoder.

The decoder’s role is more simple, aiming to decode and upscale this warped representation of an encoded input back into an acceptable 64 x 64 output through the use of convolutional layers. With training, the autoencoder will hence learn to both generate a warped representation of an input, and to restore such an intermediate output back into a realistic mimicry of the original input.

The rest of the class defines the loading of pre-trained weights, the saving of intermediate weights, and the training process.

def load(self, swapped):
 (face_A,face_B) = (decoder_AH5, decoder_BH5) if not swapped else (decoder_BH5, decoder_AH5)try:
 self.encoder.load_weights(self.pretrained_weight + encoderH5)
 self.decoder_A.load_weights(self.pretrained_weight + face_A)
 self.decoder_B.load_weights(self.pretrained_weight + face_B)
 print(‘loaded model weights’)
 return Trueexcept Exception as e:
 print(‘Failed loading existing training data.’)
 print(e)
 return Falsedef save_weights(self):
 self.encoder.save_weights(self.model_dir + encoderH5)
 self.decoder_A.save_weights(self.model_dir + decoder_AH5)
 self.decoder_B.save_weights(self.model_dir + decoder_BH5)
 print(‘saved model weights’)

We then execute the training process for both pairs of actors, saving the weights of the model periodically for inspection and evaluation. These are then loaded into the final notebook covering output generation.

Generation

Our final notebook utilizes the trained weights of our autoencoder to generate a deepfake via facial transfer. Recall that we can connect the encoded representation of actor A with the decoder of actor B, resulting in the generation of an image of actor B but with the facial characteristics of A.

To achieve this ,we’ll utilize the convert method of our autoencoder model, which calls a simple “prediction” via the neural network, to generate an restored image of a sampled test image.

model = Model()
if not model.load(swap_model):
    print(‘Model Not Found! A valid model must be provided to continue!’)
    exit(1)faceswap_converter = PluginLoader.get_converter(conv_name)(model.converter(False),
 blur_size=blur_size,
 seamless_clone=seamless_clone,
 mask_type=mask_type,
 erosion_kernel_size=erosion_kernel_size,
 smooth_mask=smooth_mask,
 avg_color_adjust=avg_color_adjust
)list_faces=get_list_images_faces()batch = BackgroundGenerator(list_faces, 1)for item in batch.iterator():
    convert(faceswap_converter, item)

With our generative model defined, we trained our network for a total of 90 epochs, saving weights at regular intervals for both actors. The results, together with the original target image for both actors, are shown below:

Generated outputs of Spiderman together with original input (left), after 30, 60, and 90 epochs

Close in of Spider -Cage after 60 epochs.

Generated outputs of Thor together with original input (left), after 40, 70, and 90 epochs

From our results, we can conclude that while we’re on the right track with our model, there’s still space for improvement for our model with regards to output realism. The network converges at roughly 60 epochs, with extended training time resulting in increased overfitting to the training dataset, resulting in the emphasis of more subtle features such as distinct skin tones, which are detrimental to the authenticity of our image.

Let’s analyze the weaknesses of our model and think of some possible solutions:

The primary obvious weakness with our model is the lack of a discriminative component — namely, the generated outputs from both inter- and intra-domain autoencoder setups are not evaluated for realism by a trained neural network, which is commonly observed in a GAN as a discriminator. As a discriminator is trained across the entire dataset, it would be able to reward or penalize the generator depending on whether it judges its output as belonging to the original distribution.
The generated outputs are blurry. This is partly due to the difference in resolution of the target image versus the generated output (64 x 64), but also due to usage of a simple pixel-pixel based MAE as a loss function. Literature studies have shown that more complex and composite loss functions yield better results.
Our model does not use facial landmarks for the identification of the facial attributes during extraction and transfer. This could be responsible for the “glued-on” effect more clearly observed at higher epochs.
Naturally, the more data we have, the better results we would get in terms of mimicking different characteristics.

Overall, we’ve illustrated how a deepfake generation works. Particularly, we’ve shown how feasible the process is even using a limited dataset and free cloud-based resources. In our next article, we’ll examine if it’s possible to distinguish synthetic images from real ones by varying the complexity of the architecture.

If you enjoyed this article, please consider subscribing to GradientCrescent to stay updated on our latest publications.