Paper review- Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence

yw_nam

Published in

Analytics Vidhya

7 min readJul 30, 2020

Every Figures, Tables are come from the paper. (Marked if it is from another paper or other website.)

Content

Abstract
Method
Result and Experiments
My Opinion

1.Abstract

Fig 1. Qualitative results using CelebA, ImageNet

Fig 2. Qualitative results using Tag2pix

This paper is accepted by CVPR 2020.

The authors say that colorization tasks have been successful in grayscale images, but in the case of sketch or outline images, they are challenging because they do not include pixel intesity.

The commonly used method to solve this problem is that utilize User hint and Reference image.

However, in the case of the reference image, the study is still slow due to few datasets and information discrepancy between the sketch and the reference.

Therefore, the authors try to solve the above problem in two ways .

we utilize an augmented-self reference which is generated from the original image by both color perturbation and geometric distortion. This reference contains the most of the contents from original image itself, thereby providing a full information of correspondence for the sketch, which is also from the same original image
our model explicitly transfers the contextual representations obtained from the reference into the spatially corresponding positions of the sketch by the attention-based pixel-wise feature transfer module, which we term the spatially corresponding feature transfer (SCFT) module

The authors argue that the above two methods can optimize the network without manually annotated labels.

Currently(2020–7–29), the official code for this model has not been released yet.

2. Method

2–1. Overall Workflow

As illustrated in Fig. 3, I is a color image source, I_s is a sketch image extracted using an outline extractor, and I_r is a reference image obtained by applying thin plate splines transformation (TPS).Models that receive I_s and I_r extract activation maps f_s and f_r using two independent encoders E_s(I_s) and E_r(I_r).

To transfer information from I_r to I_s, this model uses the SCFT module inspired by the self-attention mechanism. SCFT calculates dense correspondences between all I_r and I_s pixels. Based on visual mapping obtained from SCFT, context features that combine information between I_r and I_s get final colored output by passing through the models.

2–2.Augmented-Self Reference Generation

Fig 4. Appearance transform a(·) and TPS transformation s(·)

Appearance and spatial transformation are performed to generate I_r from I. At this time, the authors argue that since I_r is generated from I, it is guaranteed to include data useful for colorizing I_s.

Appearance transform a(·): The process of adding particular random noise to each RGB pixel.The reason for doing this is to prevent the model from memorizing color bias.(i.e apple-> red) In addition, the authors argue that by giving a different reference for each iteration, the model enforced to utilize both E_s and E_r. At this time, a(I) is used as ground truth I_gt.

TPS transformation s(·): After applying the appearance transform, the non-linear spatial transformation operator is applied to a(I). The authors said that this prevents model from lazily bringing the color in the same pixel position from I, while enforcing model to identify semantically meaningful spatial correspondences even for a reference image with a spatially different layout, e.g., different poses.

2–3.Spatially Corresponding Feature Transfer

Fig 5. Spatially corresponding feature transfer (SCFT) module

The purpose of the SCFT module as claimed by the authors is as follow.

Learning where to get information from a reference.
Learning what part of the sketch image should be transferred to.

First, two Encoders E_r and E_s consist of L conv layers that create activation maps. Then, down sampling is performed to fit the activation maps to the same f^L and spatial size, and concatenate along the channel. Therefore, the final activation map V is as follows.

Eq 1. Final activation map V and notations

At this time, “;” is the channel wise concatenate operator.
Through the above, the authors argue that low-high level features can be captured simultaneously. And we can reshape V like this:

Because of many notation in this part, i think it should be better using image from latex editor.(Because medium isn’t support latex editor.)

Eq 5. synthesized color image c_i

2–4.Objective Functions

Similarity-Based Triplet Loss.

The author argues that by applying spatial transformation, it is possible to obtain the total information of the weight w_ij that represents how much the i th pixel position of the input image, or a query, is related to the j th pixel position of the output, or a key. In addition, they say that the value of wij can be considered as a pixel-to-pixel association. Using this pixel-level correspondence information, the authors propose a similarity-based triplet loss to directly supervise the affinity between the pixel-wise query and key vectors used to compute the attention map.

At this time, S(·, ·) is a scaled dot product, and γ is the margin indicating the minimum distance that S(v_q, v^p_k) and S(v_q, v^n_k) should maintain. L_tr encourages the query representation to be close to the correct (positive) key representation, while penalizing to be far from the wrong (negatively sampled) one

L1 Loss.

Since the ground truth image I_gt already exists (in 2–2), the reconstruction loss, which imposes a penalty on different I_gt and output, can be calculated as follows.

Eq 7. Reconstruction loss

Adversarial Loss.

In this paper, Adversarial loss of Conditional GAN is adopted. The authors say that it is important to preserve the input image I_s, so I_s is set as a condition. Therefore, the adversarial loss formula is calculated as:

Perceptual Loss.

It is said that perceptual loss makes the output produced by the network more plausible (?). The authors employ a perceptual loss using multilayer activation maps to reflect not only high-level semantics but also low-level styles as

Φ_l represents the activation map of the l‘th layer extracted at the relul 1 from the VGG19 network

Style Loss.

For a given Φ_l ∈ R(C_l×H_l×W_l), the style loss that narrows the difference in covariance between activation maps is calculated as follows.

Final Loss.

So, Final loss function is as follow:

3. Result and Experiments

Data.

The datasets using in paper are: Tag2pix, Yumi Dataset, SPair-71k Dataset, ImageNet, Human Face Dataset, Edges→Shoes Dataset.
For the description of each data, please refer to the article and link.

Experiments

Fig 6. Qualitative comparison of colorize results with the baselines

Fig 7. Qualitative comparisons with the baselines on the Tag2pix dataset.

Compared with other baselines results in Fig. 6, Fig. 7, the result of this paper looks better than other baseline result.

Table 1. Quantitative comparisons over the datasets

Table 1. The score measured using FID, the authors’ papers show the best performance.

Fig 8.The effectiveness of loss functions

Table 2. FID scores according to the ablation of loss function terms

As shown in Fig. 8 and Table 2, it can be seen that the performance improves with each term.

There are many other user studies, figures, and tables, so if you are interested, please refer to the appendix of the paper.

My opinions

Personally, I am very surprised that a dataset like Tag2pix works well. In general, anime datasets are few datasets, and there is a lots of variable like poses, viewpoints, and body proportions, so I thought that learning was difficult. Of course, even considering that it is a specific field of colorization, it is amazing to see that it works well.