Image forgery localization(IFL) using UNET architecture

Image forgery localization is one of the techniques to detect the forgeries or manipulated regions, there are some algorithms that are already existed for image forgery detection. But this approach slightly differs from typical forgery detectors, a forgery detector can classify whether the given image is manipulated or not, whereas the forgery localization determines the manipulated region by segmenting the image to different regions. In this blog, we are going to implement IFL using UNET architecture.

Santosh Kumar Pothabattula

Published in

Analytics Vidhya

13 min readMay 13, 2020

Understanding the Problem

There is active research at digital forensics to gain the confidence that the given digital content is authenticated, but why researchers are very concerned about digital forensics? Just have a look into the following picture, what do you notice? Looks very natural and nothing suspicious right, but it is actually manipulated.

Figure1: An example of manipulated digital content with real looks (source)

Nowadays we can not trust any digital content as there is a very higher chance that the content can be attacked due to the higher availability of digital tampering tools such as Photoshop, Adobe After effects, and even some AI algorithms as well.

The following image (Figure2) shows that the tampered regions in Figure1, meaning that dark highlighted regions indicate that a particular region has tampered.

Figure2: An example of a mask image for the manipulated image(source)

For a human, it is almost impossible to determine every time whether the digital content is pristine or manipulated. Basically the most common methods of image tampering are Copy-move, Image splicing, Image in painting. Copy-move is a kind of attack where a little part of the image is copied and pasted at some other regions in the same image. Image Inpainting is removing some details in the images and smoothing at the removed regions with its neighboring colors and properties such that the human cannot recognize that image was attacked. Finally, Image splicing is selecting some region in the one particular image and pasting that region in another suitable image.

There should be a proper algorithmic system that should detect the attacked images. Having said that, detecting whether the image is manipulated or not, is not just sufficient and it does not give the complete solution, hence the algorithm should localize the manipulated regions in the given image.

After understanding the above problem, here our objective is to build a model such that for a given image it should able to localize the tampered region.

Why Only with AI?

For a manipulated image at tampered regions, there will be a sudden change in statistical parameters especially for image splicing kind of attack. Having said that tampering can be detected with the change in statistical parameters, can we achieve the solution for this problem with only traditional image processing techniques? Yes, we can, but it might require a lot of manual effort, a lot of research and years of time, let me explain this in detail.

For example, if my objective is to detect and eliminate the higher frequencies in the given image. Yeah, this is a very simple objective right? We can achieve this somehow by attenuating the higher frequencies in it on hiring the Gaussian Filter with some mean and std.deviation. Note that for this objective we have a well suitable filter, but for our main objective (i.e. is to localize the tampered region) there is no pre-defined filter function and to determine perfect filter based on objectives it might take several attempts that defiantly takes years of research and time.

Also in the same above example for all kinds of images, the transfer function of the filter is the same so weights of the filter never change as per our requirement. Whenever we have a complex objective and no idea which filter/combinations of filters to use, then engineers or the researches simply choose Deep learning techniques.

Using DL techniques like CNN automatically learns the filter transfer functions that are required for our objective. While in the training phase CNN continuously updates and learns the filter weights w.r.t change in the obtained result to actual result and this is the main core part of any Artificial Intelligence algorithm.

Data and EDA

Before going into the further details of the Deep learning architectures, we need to get suitable data for our objective. Here I am considering the IEEE IFS-TC Image Forensics Challenge[1] data that was containing both pristine and manipulated image data. The manipulated images in the data were created using all 3 kinds of manipulated methods that I was discussed above. Let us quickly grasp the data with some exploratory data analysis.

There are two classes of images pristine and fake, where fake images have their corresponding gray mask where dark color in the mask indicates manipulated regions.

Figure3: Bargraph showing classes and their counts in the data

In the data, there are 1050 pristine images and 450 fake images as said for these 450 fake images will have their corresponding masks as we can see in the following figure.

Figure4: Sample from the fake images and its grayscaled mask.

Plots in the Figure5 and Figure6 indicate that distributions of image width and heights separately for both fake and pristine. Using this PDF plots we can quickly grasp what can be the typical size and its likeliness of happening in the future.

Figure5: PDF for the **Widths** of the images in both classes

Figure6: PDF for the **Heights** of the images in both classes

As we have seen in Figure4, mask images are seemed to be in grayscale meaning that it will have only one color channel. However for some of the mask images, there are 3 channels and even 4 channels, Figure7 describes the bar plot for the number of mask images with 1channels, 3channels, 4channels.

Figure7: Barplot indicates counts for no.of channels in the mask images.

In the multi-channel mask images channels after the first axis, it simply adds redundancy and does not add any further information, hence I considered only a single channel in every mask image. For some of the masks, little noise was present. To attenuate this, I have added a simple Gaussian blur that helps for smoothing, Finally this I consider as Binary-Mask. Please have a look on Figure8

Figure8: Image shows the difference between the given mask and smoothed binary mask

Out of these 450 fake images and their 450 corresponding masks, there can be the presence of defective Masks, defective I mean if the dimensions of the fake image and its corresponding mask are not same then there will be no use of both fake image and its mask, Hence avoiding such images and its masks. On checking the above condition we got only 8 masks that are defective, hence deleted those masks and its corresponding images.

After performing EDA on Data the quick conclusions are:

–We have a very little amount of data only 1050-pristine images and 442 fake images (after data cleaning), and the data is imbalanced between pristine and fake classes. We need to scale up the data for better results.

–From the PDFs plots we observed, the scope of image sizes for fake classes are somewhat higher than the pristine classes

–Some masks have multi-channel information realizing this we have considered only 1st channel and converted every mask to binary.

Some existing approaches to our problem

Before considering our base model let us understand some existed methods for our objective. There are some approaches based on Photo Response Non-Uniformity (PRNU) [2][3], it is very unique to every camera, meaning that the photo captured by the particular camera was affected with a fixed pattern of noise from the inbuilt image sensor in the camera.

If we consider and calculate PRNU as a preprocessing step, that definitely adds a great advantage for the model, such that the model can differentiate two images if they are captured with two different cameras. Using this characteristic there are many good approaches to detect manipulated regions however it requires camera’s information and it fairly works well for image splicing kind of attack, because most of the cases in image splicing, the spliced region, and targeted image are from different sources captured with a different camera, but we can not guarantee that in copy-move kind. Since we don’t have any information about the camera in our data we can not take advantage of PRNU.

There are some well-defined architecture like BusterNet[4] specially designed for image forgery localization kind of tasks, anyhow BusterNet is designed for only copy-move kind of attacks but in our data total three types of attacks are there, and in real-time it can be more than that, hence we should get a model that should work on all most all kinds of attacks.

UNET as Base Model

As discussed earlier our desired model should predict the masks for the given image, therefore our model should contain encoder and decoder kind of architecture, as we are trying to highlight the tampered regions in the given image, the model needs to segment the image with two classes one is for the tampered region (dark color) and another is for the untouched region (light color).

Here I am considering UNET[5] as a base model because it already has proven results for similar kinds of image segmentation and also it meets the above requirements as well. The first-time UNET is developed for Biomedical Image Segmentation, the architecture contains two paths one is for encoder which contains some stack of Convolutions, Maxpooling layers and another is for a decoder that is symmetric to the encoder path.

From EDA we realized that the data is very limited, as we know that deep learning models are very data hunger, with this smaller amount of data we cannot build a reasonable model. To increase the impact of data, I augmented the data with some other library called albumentations[6]. But why I used some special library for augmenting the data because image augmentation should happen on both image and its masks, for example as part of data augmentation, if my image was rotated 90 degrees then its corresponding mask also should be rotated in the same way, to do this I found this library was much useful.

Also, note that images were resized to 512 X 512 and all pixel values are divided with 255 as part of preprocessing steps

Figure10: Data augmentation using **Albumenations** library

To avoid data leakage I have split the data into train, validation, and test sets and after splitting I have augmented the data separately for these three sets. Note that for fake images we have masks, but for pristine images, we do not have any masks, hence for all pristine images considering default.mask.png as default mask where it contains only white pixels and it is same for all pristine images.

After splitting and augmenting, this data was trained for on UNET architecture optimizing with adam, on reducing binary cross-entropy and measuring accuracy, after running 10 epochs we got good number accuracy about 96.17% and greatly reduced log loss but the model could make proper predictions and the predicted mask image are far from ground truth mask, you can see the following images.

Figure11: Comparision between the predicted masks from the base model and ground truth masks

Let us understand why this happened:

–The well-known thing is accuracy can be deceiving when we have imbalanced data but we balanced our data while augmenting, but note that we balanced the data between pristine and fake classes.

– As per our objective, the model needs to predict each pixel in the image whether it is real or manipulated for each manipulated pixel result should be 0 and for untouched pixels result should be 1.

–If we need to balance the data we should balance between dark pixels and light pixels where dark pixels indicate manipulated region and bright pixels indicates the untouched region.

–Another way of saying is the in all most all images area of the manipulated region is very less when compared with the untouched region, due to this white pixels became very dominant thus this creates unbalanced classes and our model could not able to learn properly.

–Hence our base model showing high accuracy but not that efficient as we expected.

To create an equal impact on the model while learning to distinguish between dark and white pixels, the data should be balanced but in our case either we can not increase the manipulated pixels or should not decrease the manipulated pixels, the only way is to decrease or avoiding the white pixels.

To downsample the white pixels, this time I have completely avoided pristine class images as it contains only white pixels in the mask.

Training UNET with only Fake images

In our next UNET model, we consider only fake images to train to avoid all pristine images and training with the same settings as we did for the base model. This time we got accuracy about 93.34% after 10 epochs, but again the model could not make proper predictions as expected however these predictions are somewhat better than the first model.

Hence this tells even though we avoided complete pristine class, but white pixels in the fake images are still dominating. Data must be furthermore balanced.

Training UNET with patches extracted from Fake images

Before going to next model, as our is data is suffering from unbalanced classes, this time I have segmented each fake image into patches of size 128X128 pixels with a stride of 32 and considering the patches that only will only contain at least 25% of manipulated pixels and untouched pixels, and ignoring the rest of the patches. So that we won't get complete white region patches.

As you can see below in Figure 11 is a sample fake image and its mask image, and now in Figure12 is the extracted 128X128 patches using stride 32. Note that we got many such patches for a single image but as part of space constraint, I am displaying very few of them.

Figure12: Sample from the fake images and its grayscaled mask.

Figure13: A few patches(128X128) extracted from the sample shown in Figure11

After training with 10 epochs I got nearly 50% accuracy, and now we got the original behavior of the model, and of course, we can further improve the accuracy using this patched data on experimenting with a different model, but to get the final outcome we need to combine all predicted patches into one.

Since we have eliminated most of the patches from the image, meaning that we have considered only the patches that are having both manipulated and authenticated pixels, constructing the final mask from these predicted patches is not at all easy.

After experimenting with the above three methods we have learned:

–In our data due to less portion of manipulated regions, our base models could not able to learn properly.

–In the predicted masks (in Figure11) we can see grid-like boxes this is because none of the models were able to detect the objects itself.

–Even though we extracted the patches to balance the classes still it is hard to construct the final mask from predicted patches hence we not moving further with these patches approach.

–Since we have little data, it could be beneficial if we have any model that was already pre-trained on similar kind of data, so that we can fine-tune that weights with our data, that might work

VGG16+UNET

On understanding all the above points, for choosing the next models, we will consider the idea like if suppose the model can understand different objects in it and then as a next step if a model can segment these objects are whether manipulated or not.

To employ this idea we should use one of the object detection networks and for the next task, we can attach some UNET which will act like the segmentation network.

Now I will be trying with the VGG16 that was pre-trained on imagenet data as the backbone and that encodes the input image before sending it to the segmentation network in our case it is UNET architecture, such that we can train on combined architecture on keeping vgg16 imagenet weights unchanged.

Here I am using the “segmentation_models”[7] package so that it will provide required utilities for us, from this package we can choose vgg16 or a similar one with imagenet weights as the backbone and can club with segmentation models like UNET. Taking this advantage I have trained data on VGG16(fixed with imagenet weights)+UNET and this time since our data is imbalanced I am measuring f1-score and optimizing with adam, reducing binary_crossentropy and running for 10 epochs.

Finally, it is giving a very good number of the f1-score on test data about 0.9746 and our final model localizing the forgeries decently.

Figure14: Comparision between the predicted masks from the **VGG16+UNET** model with ground truth masks

As we can see in Figure14 our final model i.e. VGG16+UNET is localizing the manipulated regions effectively than the earlier two models. Note that if we have a large number of fake images in the data corpus then there will be a higher chance for better learning. Even though we trained with little data, this model really did a decent job of localizing the forgeries.

Future Scope

The manipulated images so far we have seen in the data are somewhat higher quality, to check on the low-quality image whether it is attacked or not, our approach may not be efficient. We can extend this approach by including one more model that was learned to convert low quality to high quality before sending it to our final model.

As of now, we have seen only three kinds of attacks as part of our case, but in future, we are not sure there can be plenty of manipulating techniques might evolve either by human or using AI itself. Hence our future model should be agnostic to any kind of attack.

All the code (available in tensorflow+keras) for the steps mentioned in this blog was posted in my Github profile, please feel free to access the code, and if you want you can use/modify for your personal usage.