Seeing is Believing — Mesoscopic Neural Networks for Synthetic Image Detection: an Implementation in Keras and TensorFlow

Adrian Yijie Xu, PhD
GradientCrescent
Published in
10 min readAug 23, 2019

This is the last part of our special feature series on Deepfakes, exploring the latest developments and implications in this nascent field of AI. This time, we cover an innovative approach to deepfake and synthetic content detection.

Introduction

Over the past few tutorials, we’ve introduced and discussed the theories and concepts behind deepfake generation, implementing a simple example as a proof of concept. We’ve also briefly discussed the techniques used for deepfake and manipulation detection. The possible negative implications arising from the the democratization of high-quality deepfakes generated on consumer-grade hardware have led to increasing demand in effective detection methods.

Let’s put a smile to that face. (Ctrl Shift Face)

We’ve covered some of these in our previous articles, with some of the most effective approaches relying on sequenced data to identify changes in a subject over time. While accurate, such approaches are highly resource- and time-intensive, and do little to address the proliferation of individual forged image examples.

In contrast, mesoscopic analysis has arisen as an instance-based approach to deepfake detection that may also be useful in identifying other synthetic images. Introduced in a study by Afchar et al. , mesoscopic analysis relies on the observation that at both the microscopic and semantic level, distinguishing between real and forged images is difficult. Under the former context, compression-related noise may outshine any significant differences behind, while in the latter case, the overall semantic content behind two images may more similarities than differences, as is often observed with manipulated forgeries.

To combat this, the authors network capable of capturing smaller semantic (or mesoscopic) details using a relatively shallow neural network architecture, which also possesses the additional advantage of reduced training time. Using a relatively shallow intermediate architecture, an accuracy of over 98% percent could be achieved on deepfaked examples, versus an accuracy of 93–96% (depending on compression) for the more traditional Xception-architecture.

To gain an intuition on how an mesoscopic neural networks work in better capturing small details of an image, let’s recall of how a convolution works. We’ve previously covered this in one of our earliest tutorials, but to refresh our minds, let’s consider the simple case of an edge detector as presented by Andrew Ng.

Simple edge detector, as presented by Prof. Ng in his Coursera Modules.

From the diagram, we observe that the input image is subjected to a filter (or kernels) to check for the presence of certain patterns — in this case, a straight line. If a particular pattern is detected, its presence is registered in a smaller output activation map. The next layer would then utilize filters to detect more complex patterns of edges. Given enough layers, these maps become complex enough to capture the more comprehensive characteristics of an image, such as the outline of the target object. While a deep neural network may be great for traditional object recognition tasks, they are can be less effective in distinguishing faked composite images from real ones, where the differences may be minute enough that a traditional neural network is unable to detect them at the higher semantic outlines.

Unlike traditional neural networks, mesoscopic neural networks deliberately use a mixture of shallow architecture and small sized filters in order to shift focus away from the higher-level pattern abstraction– instead aiming to learn simpler, localized patterns that better capture detail. As such, they’re argued to be better able in detecting of the minute differences between current faked image data versus real images.

Although we’ve seen how these models perform on deepfaked datasets — but what about truly unique composite images? This task may be significantly harder, the dataset consists of images (either real or faked) where each image is of a single unique individual, rather than a series of images of a particular subject.

To answer this question, we’ll be evaluating three architectures on the Flicker Faces HQ dataset and a composite StyleGAN dataset produced from the former.

Implementation

We utilized the publicly available high-resolution datasets from the NVIDIA StyleGAN repository — namely the FFHQ dataset for true examples together with a collection of synthetically generated counterparts using the StyleGAN architecture. The workings of StyleGAN-based image generation have been covered in previous articles, and hence will not be discussed here. It must be mentioned that synthetic dataset was created with a truncation parameter value of 5, judged to be a realistic result. Examples with a higher truncation tend to appear more distorted, and would in theory differ more significantly from the real-world examples in the FFHQ dataset,

Four image examples from our StyleGAN dataset. You’ll notice that some background warping and discoloration can be observed upon closer inspection — some of the weaknesses of generative models discussed in previous articles.

Finally, to save space and reduce training time, we’ll be using a reduced dataset size of only 500 images per class. Naturally the use of more intra-domain data would likely increase the accuracy of our models, but the dataset size was judged sufficient as a proof of concept.

As usual, our code was written in Keras and Tensorflow, and run in a Google Colaboratory GPU-enabled notebook instance. All code is available on the GradientCrescent repository.

To begin with, let’s import our datasets:

#Download the training dataset- 10psi
#!gdown https://drive.google.com/uc?id=1aJvpWIQ6G8fHwRCWRgrtKaKUQ8zaRLsK
#Download the training dataset- 5psi
!gdown https://drive.google.com/uc?id=1Zxke5rZZOPRNAf0AmETNHUfbn6EPyeYi

#Downlaod FFHQ data
!gdown https://drive.google.com/uc?id=1ky465Gxe-7KYF5cheXgxHIkevJu7Ofbc
#Unzip all folders!unzip real.zip -d data
!unzip stylegan_05psi.zip -d data

Next, let’s set up Keras’s ImageDataGenerators to adopt a standard set of preprocessing functions. This should be familiar to readers familiar with the Keras package, but if not, we highly recommend you consult our earlier tutorials going into more detail on the subject.

from tensorflow.keras import backend as K
from tensorflow.keras.models import Model ,load_model
from tensorflow.keras.layers import Flatten, Dense, Dropout
from tensorflow.keras.applications.inception_resnet_v2 import InceptionResNetV2, preprocess_input
from tensorflow.keras.optimizers import Adam, RMSprop
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.callbacks import ModelCheckpoint
import numpy as np
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
from tensorflow.keras.applications.inception_v3 import InceptionV3
import tensorflow as tf
DATASET_PATH = ‘/content/data’IMAGE_SIZE = (299, 299)
NUM_CLASSES = 2
BATCH_SIZE = 32 # try reducing batch size or freeze more layers if your GPU runs out of memory
FREEZE_LAYERS = 16 # freeze the first this many layers for training
NUM_EPOCHS = 30
LEARNING_RATE = 0.002 #Slow learn rate as we are transfer training5e-5
DROP_OUT = .5

train_datagen = ImageDataGenerator(preprocessing_function=preprocess_input,
rotation_range=50,
featurewise_center = True,
featurewise_std_normalization = True,
width_shift_range=0.2,
height_shift_range=0.2,
shear_range=0.25,
zoom_range=0.1,
zca_whitening = True,
channel_shift_range = 20,
horizontal_flip = True ,
vertical_flip = True ,
validation_split = 0.2,
fill_mode=’constant’)
train_batches = train_datagen.flow_from_directory(DATASET_PATH,
target_size=IMAGE_SIZE,
shuffle=True,
batch_size=BATCH_SIZE,
subset = “training”,
class_mode=’binary’)
valid_batches = train_datagen.flow_from_directory(DATASET_PATH,
target_size=IMAGE_SIZE,
shuffle=True,
batch_size=BATCH_SIZE,
subset = “validation”,
class_mode=’binary’)

To establish a benchmark target, let’s build and setup a InceptionV3 model trained on the ImageNet dataset, freeze the first 90 layers, and finetune the model using a learning rate of 2E-4 for 15 epochs. Note that in the final notebook, this section has been commented out. Make sure to enable it (and disable the mesoscopic architecture in turn) in order to replicate our results.

from tensorflow.keras.applications.inception_v3 import InceptionV3
from tensorflow.keras.preprocessing import image
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D,BatchNormalization,Conv2D,MaxPooling2D,Dropout,Flatten,LeakyReLU
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Input
net = InceptionV3(include_top=False,
weights=’imagenet’,
input_tensor=None,
input_shape=(299,299,3))

x = net.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation=’relu’)(x)
# add dropout
x = Dropout(0.5)(x)
x = Dense(256, activation=’relu’)(x)
x = Dropout(0.5)(x)
x = Dense(32, activation=’relu’)(x)

predictions = Dense(1, activation=’sigmoid’)(x)
model = Model(net.input, predictions)
for layer in net.layers[ :90]:
layer.trainable = False
model.compile(optimizer=Adam(lr=LEARNING_RATE),
loss=’binary_crossentropy’, metrics=[‘accuracy’])
print(net_final.summary())result=model.fit_generator(train_batches,
steps_per_epoch = 100,
validation_data = valid_batches,
validation_steps =50,
epochs = NUM_EPOCHS,
)

Training should take a couple of hours. Afterwards, we can plot our results using the Matplotlib package.

import matplotlib.pyplot as pltdef plot_acc_loss(result, epochs):
acc = result.history[‘acc’]
loss = result.history[‘loss’]
val_acc = result.history[‘val_acc’]
val_loss = result.history[‘val_loss’]
plt.figure(figsize=(15, 5))
plt.subplot(121)
plt.plot(range(epochs), acc, label=’Train_acc’)
plt.plot(range(epochs), val_acc, label=’Test_acc’)
plt.title(‘Accuracy over ‘ + str(epochs) + ‘ Epochs’, size=15)
plt.legend()
plt.grid(True)
plt.subplot(122)
plt.plot(range(epochs), loss, label=’Train_loss’)
plt.plot(range(epochs), val_loss, label=’Test_loss’)
plt.title(‘Accuracy over ‘ + str(epochs) + ‘ Epochs’, size=15)
plt.legend()
plt.grid(True)
plt.show()

plot_acc_loss(result, 30)
Performance of the InceptionV3 architecture (benchmark)

Our validation accuracy approaches 80% . That’s not terrible, but it’s not great either. You’ll notice a high variation in accuracy values, a common theme across our modelling attempts here. This is due to a lack of convergence, and is related to the wide range of image-specific factors, such as pose, facial details, and background noise. The InceptionV3 model works well as a benchmark as it’s a pre-trained, highly-complex architecture, with over 300 layers.

With our benchmark results done, let’s train our mesoscopic models. We’ll start with the Meso4 model.

x = Input(shape = (299, 299, 3))

x1 = Conv2D(8, (3, 3), padding=’same’, activation = ‘relu’)(x)
x1 = BatchNormalization()(x1)
x1 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x1)

x2 = Conv2D(8, (5, 5), padding=’same’, activation = ‘relu’)(x1)
x2 = BatchNormalization()(x2)
x2 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x2)

x3 = Conv2D(16, (5, 5), padding=’same’, activation = ‘relu’)(x2)
x3 = BatchNormalization()(x3)
x3 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x3)

x4 = Conv2D(16, (5, 5), padding=’same’, activation = ‘relu’)(x3)
x4 = BatchNormalization()(x4)
x4 = MaxPooling2D(pool_size=(4, 4), padding=’same’)(x4)

y = Flatten()(x4)
y = Dropout(0.5)(y)
y = Dense(16)(y)
y = LeakyReLU(alpha=0.1)(y)
y = Dropout(0.5)(y)
y = Dense(1, activation = ‘sigmoid’)(y)

model = Model(inputs = x, outputs = y)
model.compile(optimizer=Adam(lr=LEARNING_RATE),
loss='binary_crossentropy', metrics=['accuracy'])

As you can observe, the architecture is incredibly simple, involving just four sets of convolutions. Note that as this model hasn’t been pre-trained, it’s reasonable that we train it over more iterations to compensate — this was achieved by training at a learning rate of 0.002 for 30 epochs, followed by another 20 epochs at a lower learning rate of 2E-4. This difference helps to balance convergence with overall training time.

Performance of the Meso4 model after training (top) and fine-tuning (bottom).

With a validation accuracy approaching 70%, that’s slightly worse than the Inception architecture, but a good starting point. The difference in performance can be attributed to a multitude of factors, but the latter network’s use of so-called “Inception” blocks stands out. Essentially, these self-contained blocks contain convolutional layers with small sized filters in parallel, with the results being pooled and concatenated at the end of the block. This approach not only allows for the model to learn across both multiple intra-scale filters simultaneously at great detail, but also to learn across multiple scales in parallel.

Taking inspiration from the Inception architecture, the authors then designed the Meso4-Inception model, featuring blocks with parallel convolutional layers defined with the InceptionLayer method:

def InceptionLayer(a, b, c, d):
def func(x):
x1 = Conv2D(a, (1, 1), padding=’same’, activation=’relu’)(x)

x2 = Conv2D(b, (1, 1), padding=’same’, activation=’relu’)(x)
x2 = Conv2D(b, (3, 3), padding=’same’, activation=’relu’)(x2)

x3 = Conv2D(c, (1, 1), padding=’same’, activation=’relu’)(x)
x3 = Conv2D(c, (3, 3), dilation_rate = 2, strides = 1, padding=’same’, activation=’relu’)(x3)

x4 = Conv2D(d, (1, 1), padding=’same’, activation=’relu’)(x)
x4 = Conv2D(d, (3, 3), dilation_rate = 3, strides = 1, padding=’same’, activation=’relu’)(x4)
y = Concatenate(axis = -1)([x1, x2, x3, x4])

return y
return func


x = Input(shape = (299, 299, 3))

x1 = InceptionLayer(1, 4, 4, 2)(x)
x1 = BatchNormalization()(x1)
x1 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x1)

x2 = InceptionLayer(2, 4, 4, 2)(x1)
x2 = BatchNormalization()(x2)
x2 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x2)

x3 = Conv2D(16, (5, 5), padding=’same’, activation = ‘relu’)(x2)
x3 = BatchNormalization()(x3)
x3 = MaxPooling2D(pool_size=(2, 2), padding=’same’)(x3)

x4 = Conv2D(16, (5, 5), padding=’same’, activation = ‘relu’)(x3)
x4 = BatchNormalization()(x4)
x4 = MaxPooling2D(pool_size=(4, 4), padding=’same’)(x4)

y = Flatten()(x4)
y = Dropout(0.5)(y)
y = Dense(16)(y)
y = LeakyReLU(alpha=0.1)(y)
y = Dropout(0.5)(y)
y = Dense(1, activation = ‘sigmoid’)(y)
model = Model(inputs = x, outputs = y)model.compile(optimizer=Adam(lr=LEARNING_RATE),
loss=’binary_crossentropy’, metrics=[‘accuracy’])

This model was then trained at the same conditions as previously: at a learning rate of 0.002 for 30 epochs, followed by another 20 epochs at a lower learning rate of 2E-4.

Performance of the Meso4-Inception model after training (top) and fine-tuning (bottom).

The result is a significant improvement over the original Meso4 model, with a validation accuracy exceeding 90 %. This represents a 20% improvement over the base model, a roughly 10% increase over the Inception architecture’s attempt, and performs adequately to serve as a simple classifier for distinguishing today’s synthetic forgeries.

As mentioned in our previous articles, StyleGANs still have trouble with background warping, facial asymmetry, and hair-related details. Naturally, as skills and capabilities improve, it’s fully expected that future generative models, deepfake or otherwise synthetic, will address the shortcomings of existing approaches. While it may be up to to researchers in the field to decide the ethics and implications of the tools they create, society will need play a role in learning to adapt a greater level of critical analysis in how we deal with information, in age when seeing will no longer be guaranteed believing.

We hope you enjoyed this article. To stay up to date with the latest updates to GradientCrescent, please consider following the publication. Next up, we move to a special long-term series on Reinforcement Learning. But before we dive into it, we’ll take another fun crack at an old favourite topic of ours.

--

--