The Future of Photo Editing, Style Transfer and Deep Learning

Subhajit Mandal
Aug 25, 2017 · 6 min read

Over the last few years we have seen a lot of progress in the field of image processing using deep learning. In 2015, we saw a major step forward in style transfer where a picture could be redrawn by a computer with a specific artistic style. Then we saw how computers can learn to create realistic images themselves using deep convolutional generative adversarial networks absent much supervision. In 2016, we saw how a computer can fill colors into a drawing outline or a design. This year, we got Creatism from Google, which can automatically crop visually appealing scenes from spherical panoramas and apply filters to create professional photographs. Also, we saw an advanced version of the neural style transfer algorithm from Adobe (deep photo style transfer).

The last two seem especially fascinating to me. So, I thought I shall try my hand at the deep photo style transfer algorithm to see how we can make use of it.

My Experiment

Deep Photo Style Transfer is about applying the style of one image onto another image, while keeping the output looking realistic. For my experiment, I considered two pictures as described in the paper: one for the content and another for the style. After applying the style onto the content, and some more manual editing, I got the final output. The overall process looked like the following.

We can see how the colors of the sky and the land from the style image have been beautifully applied onto the content image.

A Few Words on the Technicalities

Before we jump into further discoveries, I wanted to give a high level overview of how this thing works, for those who are interested. You may skip this section if not so much interested in the theory or already know everything.

So, how does it work? Answer: By minimizing a loss function, just like any other machine learning algorithms. The loss function here is a combination of dissimilarities of the output image with the content image as well as the style image. For calculating the dissimilarities, the images are first converted into content features and style features using a famous pre-trained neural network called VGG19. These features are derived from the feature maps in different layers of the neural network. Now, let us see what all types of losses are there and how we define the total loss.

Content Loss: This is just a plain squared euclidean distance between the content features of the output image and the content image. As the name suggests, it attempts to keep the content of the output image similar to the content image.

Style Loss: This one is tricky. How does one mathematically define the style of an image? For starters, we can define it as a set of correlations among different variations of the same image, obtained by applying different types of filters applied on the image. This will give us a matrix, called the Gramian matrix. To measure the style loss, we take the squared differences between the Gramian matrices of the output image and the style image. In our case, we use the style features from VGG19 instead of direct filters to calculate the Gramian matrix. Also, we compare only the semantically similar areas of the pictures, e.g. the style of sky with the sky, not with that of the water. The semantic labeling can be done manually/automatically beforehand.

Total Variation Loss: Of course nobody likes a very noisy image. To tackle this, we try to minimize the total variation loss, which essentially instructs the optimizer that two adjacent pixels of the output image should not have drastically different values.

Affine Loss: This is the most important aspect of this method. This tries to ensure that for any selected area in the output image, the RGB values of the output pixels can be obtained with a single linear transformation of the content RGB values in the same area. This makes sure that the main structures of the content are preserved, and there are no distortions in the output.

A weighted combination of the above mentioned loss types make up the actual loss function. The machine gradually “paints” the image while minimizing the loss.

Implementation and Results

Now let’s look at what I have done. First I labelled the data manually for semantic segmentation. As discussed above, this segmentation helps in preventing imposition of wrong style on the wrong segment (e.g. land style on sky!). The segmentation looked like below.

After that I carried out the image generation in two steps:

  1. Apply the neural style transfer without optimizing the affine loss.
  2. Apply the deep photo style transfer with affine loss and intialization image taken from step 1.

In the following illustration we can see how the computer is generating the output for step 1 (without affine loss):

We can see how it resembles a painting rather than a realistic image. The distortions make it look like this. But when we repeat the process while accounting for affine loss (strength of the deep photo style transfer algorithm), the output looks a lot different:

Now it looks more like a realistic picture, although still there are some distortions in the color space. After this image was generated, I edited the original image by replacing the sky and land with that from the output image. And the final merged result looks like this:

Is it more beautiful than the original one? Is it surrealistic? I shall wait for your opinion on that. One thing for sure, the sky and the land look very similar to those in the style image now, as we expected.

Future: What Lies Ahead?

This can be a tremendously useful feature in photo editing in the future. We can just mark two areas in two pictures and tell the computer to transfer the style of one to the content of another with a few clicks. Today it takes a significant effort to accomplish that kind of tasks. That is a very powerful feature for photographers and designers. Isn’t it?

However, currently there are technical barriers to implement such a feature for normal users. For example, each iteration in the above mentioned image generation process took me around 1 minute to run, and I spent 4 days with fine-tuning the model. I am not saying that my computer is a high-end one, but I believe my computer is a very close representation of the most used computers these days (4 cores, 8GB RAM, no GPU). Now do you get the struggle of an average user?

In my view, in 5 years time most of the average users are going to have GPU’s. So, by then it may be a regular feature of Photoshop. In 10 years, this may even become an ordinary feature. Let’s wait for the revolution and hope for the best!

Original Post: https://mandalsubhajit.wordpress.com/2017/08/13/the-future-of-photo-editing-style-transfer-and-deep-learning/

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade