Changing the season in a photo using neural networks

Maiya Rozhnova
Deelvin Machine Learning
12 min readAug 23, 2022

How to change the season in a photo? During the COVID-19 pandemic, applications for changing background in a photo and video have become popular. Algorithms in such applications use neural networks that can highlight the boundaries of an object in a photo and then overlay the cut-out object on a new background. However, the result of the work of such algorithms is not always harmonious as you can get an image of a person in winter clothes sunbathing on the beach. To add to this, this method is not suitable if there are no objects in the foreground.

An example of processing a photo with an autumn preset
Fig. 1. An example of processing a photo with an autumn preset

There exist ready-made sets of settings for processing photos — presets. For example, the autumn preset changes the colors in the photo from summer green to golden autumn (Fig. 1). The possibilities of such methods are limited as they can change only certain parameters — color, contrast, brightness, etc. However, neural networks can address tasks of different nature these days: from recognizing cats in a photo to creating a talking assistant. In this article, we will test the use of neural networks for changing the season in a photo.

Image Style Transfer Using Convolutional Neural Networks

Neural Style Transfer algorithm can be used to create an image with a chosen style. The goal of this algorithm is to find an image that best matches both the content of the first image (input) and the style of the second image (style). As a result, the location of objects in the original photo is preserved, while the colors and textures are taken from the image with a style. This method is based on a pre-trained neural network VGG-19 and its implementation on PyTorch can be viewed here. For each pair of images, one can choose the optimal parameters of the loss function: ℒtotal=αℒcontent+βstyle. Changing the α and β parameters allows one to focus on either the content of the original image or the style. Content loss is defined as the square of the error between the features of the layer l of the original image p and the generated image x:

Style loss is defined somewhat differently. Style image texture information is calculated as image feature correlation determined by Gram matrix:

To create an image with a specified style, it is necessary to minimize the root-mean-square distance between the elements of the Gram matrices of the original and generated images:

where

and ωl are weight coefficients for each layer.

This method of style transfer gives interesting and unique results when works of art are used as style images: paintings by Vincent van Gogh, Pablo Picasso, Edvard Munch and other famous artists. But how about using photos of a different season instead? Let’s try to change a winter landscape to a summer one. Figure 2 shows an example of how this method works, when the input image is my winter photo, the style image is a summer landscape, and the output is an image in the colors of the style image. This example clearly demonstrates that the original image gets completely recolored: the network does not pay attention to whether there is a person in the photo or not.

Example of Image Style Transfer algorithm with different style images
Fig. 2. Example of Image Style Transfer algorithm with different style images

If we take as an input an image of a different style which is similar in content to the original image, in the output image we see an attempt to repaint the snow in a different color (Fig. 3). By changing the parameters of the loss function we might get a less white background, but in any case, this method cannot select objects in the original image that need to be colored in a different color. Also, the content loss function will not allow removing or replacing any objects in the image, for example, changing a winter hat to a cap.

Example of Image Style Transfer algorithm with similar style images
Fig. 3. Example of Image Style Transfer algorithm with similar style images

Image Style Transfer Using GAN (Generative Adversarial Networks)

The Generative Adversarial Network (GAN) can be used to create artificial images similar to the training dataset. GAN includes a generator (G) of new images and a discriminator (classifier) (D) that determines whether the image is real or fake (Fig. 4).

GAN structure
Fig. 4. GAN structure

The goal of training the generator is to minimize the function log(1−D(G(z))), and the goal of training the discriminator is to increase the probability of assigning the correct class to both real images and artificial images received from the generator. Therefore, the objective learning function is a minimax game of two players: a generator and a discriminator.

The image-to-image translation task is the task of converting a given representation of a scene into another one, for example, a gray image into a color image, an apple into an orange, or, what is relevant for our topic, the transformation of winter into summer and vice versa. In the following section we will consider several examples of GAN-based networks that solve the problem of changing the season in a photo.

CycleGAN

An example of the work of CycleGAN from the authors of the model (upper row: winter -> summer, bottom row: summer -> winter)
Fig. 5. An example of the work of CycleGAN from the authors of the model (upper row: winter → summer, bottom row: summer → winter)

Pre-trained models are available for CycleGAN for the following transformations: summer ↔ winter (Fig. 5), zebra ↔ horse, apple ↔ orange, Monet painting ↔ photo and others. CycleGAN is based on the “pix2pix”, concept, which uses a conditional generative adversarial network. However, while pix2pix uses paired data for training, CycleGAN addresses the task of unpaired image-to-image translation since not every task has paired images in the training sample. In the context of the season change task, training CycleGAN does not require winter and summer images of the same scene and it is enough to have a set of summer and winter images with different scenes (Fig. 6). CycleGAN learns to extract features from one set of images (summer) and select a way of translating them into another set of images (winter).

An example of a paired and unpaired dataset
Fig. 6. An example of a paired and unpaired dataset

CycleGAN is called so because the training process minimizes “cycle consistency loss”. If X is a set of summer images, Y is a set of winter images, G is a transformation from summer to winter, F is an inverse transformation from winter to summer, then the loss for the task under consideration is described as follows (Fig. 7): we take image x from X, apply the transformation G, get some result and apply the inverse transformation F to it, then as a result of the double transformation (summer → winter → summer) we should get an element similar to the original x.

Fig. 7. a) scheme of interaction of generators G, F and discriminators DX, DY. b) cycle-consistency loss structure for images from set X, c) cycle-consistency loss structure for images from set Y

The difference between x and F(G(x)), as well as the similar difference between y and G(F(y)) is the required loss:

Also, the training goal is to minimize the losses of the two discriminators DX and DY, which classify the result of the transformations F and G respectively:

So we get the target loss:

Contrastive Learning for Unpaired Image-to-Image Translation

The authors did not stop at CycleGAN and proposed a faster and less memory-intensive model. The training time of this model is reduced because it is based on mapping in one direction only, so no resources are wasted on an inverse generator and discriminator. Another advantage of the model is the ability to train on a set of images consisting of a single image.

In this case, the task of unpaired image-to-image translation is solved by using patchwise contrastive learning (Fig. 8) and the InfoNCE loss function which connects the corresponding patches of a pair of images (horse head and zebra head in Fig. 8) and separates them from negative patches (yellow patches in Fig. 8). This is required to generate a realistic image, and the patches in the input and output images to have a common match (parts or shapes of the object) while remaining invariant to colors or textures.

Patchwise contrastive learning structure
Fig. 8. Patchwise contrastive learning structure

The PatchNCE loss function can be described as follows. In the process of training, features are extracted from a pair of images x and y (Fig. 9). In the second image, a query patch is selected and compared with the corresponding patch in the first image. Also, N patches of negatives are selected in the first image and the task of classifying the query patch in the second image on the set of N + 1 (N negatives + positive patch) elements is performed. As a result, the zebra head patch should be more related to the horse’s head than to the rest of the patches in the first photo.

Patchwise Contrastive Loss description
Fig. 9. Patchwise Contrastive Loss description

The final loss function consists of a combination of the loss function for GAN and PatchNCE losses for two sets of images X and Y:

The values λX = 1, λY = 1 correspond to mutual learning with identity loss. This combination of parameters is called Contrastive Unpaired Translation (CUT) and demonstrates good performance. Learning without identity loss (λX = 10, λY = 0) is called FastCUT and it can be considered a simpler and faster version of CycleGAN. This method also allows one to reproduce the texture of a given photo while maintaining the structure of the input image, which is similar to the Neural Style Transfer method described at the beginning of the article.

MSGAN: Generative Adversarial Networks for Image Seasonal Style Transfer

The task of changing the season is non-trivial and is complicated by the fact that it involves more than changing the main palette of colors from green (summer) to white (winter). It also requires adding/removing certain objects or changing their shape without spoiling objects that should not be changed in any way. For example, a snowman should not be present in the summer image, but the house should be the same color in summer and winter images.

MSGAN model was proposed to address these tasks and it is capable of generating 4 seasons for the input image. For training, a set of images with different seasons is used, paired data are not required. The MSGAN architecture consists of two generators and three discriminators and is shown in Fig. 10.

MSGAN Architecture
Fig. 10. MSGAN Architecture

First, the input image is converted from RGB to grayscale. So, to simplify the discriminator the season of the original picture is not taken into account. Then the generator transforms the original image into images with seasonal styles and for this it learns the global features of the image. The generator consists of an encoder and a decoder, symmetrical in structure, and seven layers of ResNet blocks. Residual blocks retain the size and shape of the previous network layer and affect the next layer, which can improve network performance and prevent gradient decay during training.

MSGAN uses 2 types of discriminators. One of them is binary, it answers the question “is the image real or artificial”, just like in the standard GAN structure described above. The second type of discriminator (Ds) determines the season of the generated image — it is AlexNet classifier with Softmax activation function in the last layer. The style loss function Lstyle is defined as follows:

If the season is the same, DS outputs 0, otherwise DS outputs 1.

The task of the generator is to create an image G(x|c) from an input image x given season c and to ‘fool’ the seasonal discriminator Ds. Therefore, the input image is transformed from one style to another based on the auxiliary information c and the CGAN (Conditional GAN) loss is described as follows:

In this case, instead of the negative logarithmic loss function, the least square loss function is used.

MSGAN, like CycleGAN, also uses cycle consistency loss (Lcyc) to match the input image x and the result G(F(x)) when using double season transformation.

When changing the season, the structural similarity of the content of the original and generated image should be preserved as much as possible. The authors use SSIM loss with a filtering window size of 13:

where N is the number of pixels p in windows x and G(x).

Also, when the season changes, the color scheme of the image changes significantly. Instead of the RGB space for the loss (with a sliding window of 13*13), the H (hue) component, which is more understandable for human perception, is taken from the HSV color space.

In the work under consideration, the authors took into account that when changing the season, it is not necessary to use the same transformation operator for all image regions. The image is segmented according to Hou et al. algorithm and for important segments the following loss is used:

and for other segments, loss with different weights is used:

The result of the algorithm with the selection of important segments is shown in Fig. 11a and without using information on importance in Fig. 11b. In the second image, along with the tree and grass, the bench turned green.

Fig. 11. Image generated a) with the selection of important segments b) without the selection of important segments

The authors also compared the performance of MSGAN with the results of other networks (Fig. 12). Among shortcomings, it was noted that CycleGAN generates only one style, not 4; CGAN produces noisy images; and UMGAN retains the structure of the image, but the hue of the whole image is shifted to the main color of the season (especially when looking at the color of the sky) because UMGAN was originally used in the task of conveying the style of an underwater image.

Comparison of the results of MSGAN, CGAN, CycleGAN и UMGAN
Fig. 12. Comparison of the results of MSGAN, CGAN, CycleGAN and UMGAN

CycleGAN example

Training GANs is a rather time- and resource-consuming process, so to demonstrate the operation of this type of networks, we will use PyTorch realization CycleGAN and the pre-trained summer2winter_yosemite.pth and winter2summer_yosemite.pth models (these and other available weights can be downloaded using script from the same github). The results of CycleGAN for transforming summer into winter are shown in Fig.13. Fake winter images are either gray or reddish, although there are harmonious results in some landscapes. Also, the network does not remove objects that cannot be present in a winter photo (flowers, tree leaves).

Fig. 13 An example of CycleGAN operation transforming summer → winter

Fig. 14 shows the results of the reverse transformation from winter to summer. As an output, we again get red shades and there is too much white snow left. We also see repainted objects that should have remained the same color (red berries, a person’s face).

Fig. 14 An example of CycleGAN operation transforming winter → summer

To run CycleGAN on my images, I used this instruction and ran the test.py script with the following parameters:

python3 test.py --dataroot datasets/summer2winter_yosemite --name summer2winter_pretrained --model cycle_gan --results_dir results_cycle_gan --no_dropout

It is important to create one’s test dataset in advance by placing images in the folders test_A and test_B (A — summer, B — winter), and also download pre-trained models to the checkpoints folder.

After trying several approaches to solving the task of changing the season in a photo, the following conclusions can be drawn. The Convolutional Neural Network-based model is less efficient and needs 2 images per input. The MSGAN model is capable of recoloring individual sections of a photo and generating 4 seasons photos for one input image. CycleGAN performance is poor for photos with a person because it changes the color of the entire photo, however with some landscapes you can still get acceptable quality.

I also invite you to read other articles in our Deelvin Machine Learning Blog.

--

--