Towards Transferable Targeted Attack

Published in

Machine Intelligence and Deep Learning

9 min readApr 16, 2022

A high-level presentation of the CVPR 2020 paper “Towards Transferable Targeted Attack”, authored by Maosen Li, Cheng Deng, Tengjiao Li, Junchi Yan, Xinbo Gao, and Heng Huang.

Presentation Link: https://youtu.be/PQKXUNJiQOM?t=1523

Overview

Goals

The goal of this paper is to design an adversarial attack method for neural networks. The aim is for it to be used to evaluate and improve the robustness of deep neural networks.

The attack has two main features which aid in accomplishing these goals:

The attack conserves or improves the transferability of existing attacks.
The attack remains targeted in black-box attacks.

Problems with Existing Attacks

There are two main flaws with existing adversarial attacks that this paper aims to address:

Existing attacks suffer from noise curing.
Existing attacks do not require that the classification be far away from the true class.

The paper provides two key contributions to address these issues.

Introduction of the Poincare space to address noise curing.
A triplet loss to promote classification farther from the true class.

Concepts

There are several concepts which are helpful to know to better understand this paper.

Logits

A logit is a simple formula:

logit(x) = log(x / 1 - x)

It is the inverse of the sigmoid function. It takes values in the range [0, 1] as input, and maps them onto the range [-∞, ∞].

Norms

Norms are used to calculate the magnitude of a vector. They are often referred to as distance metrics. This paper include three different norms:

L1 norm : Manhattan distance

L2 norm : Euclidean distance

L-∞ norm: returns the maximum magnitude element in the vector

Poincare Ball

The Poincare ball is an example of non-Euclidean geometry. In (b), the Euclidean length of each line segment in the ball is the same. The Poincare distance from any point on the ball to the edge tends to ∞. Figure (c) shows that the Poincare distance grows exponentially larger close to the surface of the ball.

The important takeaway from all of this is the geometry of the Poincare ball such that distances between points increase when moving closer to the surface of the ball.

Hyperparameters

These are the set of parameters for the model which are user specified.

K : a classifier

𝜖 : the maximum perturbation allowed for each pixel of an image

μ : the momentum decay factor

T : the number of iterations

𝑤 = [𝑤₁, 𝑤₂, … , 𝑤ₖ] : the ensemble weights

x : a clean, unmodified image

Existing Methods

Fast Gradient Sign Method (FGSM)

Formula to generate an adversarial example using the fast gradient sign method [1].

The fast gradient sign method generates adversarial examples by making perturbations in the direction of the cost function. This maximizes the error of the classifier.

Targeted Fast Gradient Sign Method (Targeted FGSM)

A modified version of the fast gradient sign method that generates a target adversarial example [1].

The fast gradient sign method can be modified to produce targeted adversarial examples. Instead of maximizing error, this targeted version of FGSM creates noise that reduces the loss between the instance and its target class.

This results in an adversarial example which not only produces a misclassification, but one that is targeted at a specific class.

Momentum-Iterative Fast Gradient Sign Method (MI-FGSM)

Formulas to generate an adversarial example using the momentum-iterative fast gradient sign method [1].

The momentum-iterative fast gradient sign method aims to improve the transferability and smooth the noise output of the fast gradient sign method.

MI-FGSM improves transferability by using momentum to escape poor local minima during iterations [2]. This is desirable to make the attack more effective in a black-box attack, where network architecture is unknown.

The smoother noise output due is due to more stable update direction that the momentum also provides. This has the effect of producing adversarial examples which have noise that is less perceptible to humans.

The momentum term g is calculated using the normalized gradient of the cost function and the decay factor μ.

The adversarial example is then generated by adding noise in the direction of the momentum term using a weight 𝛼. The inputs are constrained to a range of [x - 𝜖, x + 𝜖] to ensure that the noise output is minimally perceptible.

Problems with Existing Methods

Vanishing Gradient

The vanishing gradient problem describes the phenomena of the gradient tending towards zero as more iterations occur. This makes changes to weights increasingly small and can cause training to slow dramatically.

This is one factor which can lead to noise curing.

Comparison of the Poincare metric and Cross-entropy for addressing the vanishing gradient problem [1].

Noise Curing

Noise curing is the phenomena of noise developing a lack of diversity and adaptability through the dominance of historical momentum.

The top two lines in the figure below measure noise curing with the cosine similarity of two successive perturbations (modifications) to images, where a lower similarity indicates less noise curing.

The lower two lines indicate the target label probability of each method. In this test, the Poincare attack method proved to be less susceptible to noise curing and provide a higher target label probability.

A comparison of cosine similarity and target label probability over iterations for Poincare attack (Po) and MI-FGSM (CE) [1].

Transferability

Transferability is the property of a model to be able to adapt to solve a different but related problem.

The model in the paper aims to produce a more transferable attack by not only encouraging classifications closer to the target class, but also by encouraging classifications farther from the true class.

This is done with the introduction of a triplet loss. In the figure below, both models were able to create adversarial examples that would be misclassified as the target class. However, in the case with triplet loss, the adversarial examples landed much further from the true label than in the model without triplet loss.

Increasing the distance from the adversarial examples’ classification to the true label is key in increasing transferability in the case of black-box attacks, where the network architecture is unknown.

A visualization of the transferability of two models of generating adversarial examples, one with and one without triplet loss [1].

Algorithm

For each iteration of the algorithm, the model to produce the adversarial examples does the following steps.

Obtain Classifier Logits

The augmented image is input into the ensemble of classifiers. The output probability for the target class is used to calculate a logit for each classifier.

These logits are used to measure the distance from the target class to the classifier’s predicted class.

Fuse the Logits

The logits obtained in the previous step are fused in a weighted average using the ensemble weights (if the weights are equal, then this is an arithmetic mean).

Compute Loss

The model’s loss is calculated using the following formula.

The model’s cost function [1].

The cost function is a combination of a Poincare distance and a triplet loss.

The Poincare distance is measured from the clean image and the target class, and is defined as such:

Formula for the Poincare distance between two points, u and v [1].

The primary motivation to use Poincare distance instead of softmax cross entropy is its ability to preserve the magnitude of the gradient in an iterative attack.

This behavior is possible due to properties of Poincare distance:

the magnitude of the gradient only grows if the data point gets closer to the target label
the target label is near the surface of the Poincare ball

As the adversarial example gets closer to the target label over iterations the magnitude of the gradient would normally tend towards zero (using softmax cross entropy), instead, the Poincare distance’s exponential growth as the example gets closer to the target causes an increase in the magnitude of the gradient.

The resulting behavior is that instead of each iteration becoming less and less effective than the last, momentum is preserved and an improved final result is achieved.

Using the Poincare distance, the triplet loss is made up of the following components:

The distance from the adversarial example to the target class.
The distance from the adversarial example to the true class.
γ, the margin between these two distance metrics.

This combination produces a cost function that encourages both classification near the target label and away from the true label.

Compute Gradient of Loss

The gradient of the loss in the previous step is calculated.

Update Momentum

The momentum term is updated using the momentum decay factor and the gradient calculated in the previous step.

The process for updating the momentum term [1].

Update the Adversarial Example

The adversarial example is updated by applying a sign gradient.

The process for updating the adversarial example [1].

Results

Below is a comparison of success rates of several state-of-the-art techniques for producing adversarial examples (DI²-FGSM and TI-FGSM). ‘Po’ and ‘Trip’ represent Poincare distance and Triplet loss, respectively.

TI-FGSM convolves the gradients of the images with a predefined kernel to approximate the optimization of a perturbation over an ensemble. This technique is combined with Poincare distance and Triplet loss, designated by ‘TI’.

Each of the adversarial examples were generated with an ensemble of five networks, and then tested on the ensemble network and the hold-out network. The ensemble network is a white-box setting (the architecture of the network is known) and the hold-out network is a black-box setting (the architecture of the network is unknown), hence the lower success rates of the adversarial examples on the hold-out network.

The success rates of targeted adversarial attacks compared against two state-ot-the-art methods for generating adversarial examples, DI²-FGSM and TI-FGSM [1].

The below table is similar to the above one, only the models are trained to defend against adversarial attacks.

Perhaps the most interesting of all the results, is the perceptibility of the perturbations in the adversarial examples produced by the model.

Adversarial examples generated by the model (top), clean images (middle), and the noise added to each image (bottom) [1].

Above are adversarial examples generated by the model. While the changes are perceptible, they are constrained such that no human observer would reasonably classify the adversarial examples any differently than the clean images.

Reflections

This paper’s key contribution is the carefully crafted loss function for generating adversarial examples which combines Poincare distance and their triplet loss. While the effectiveness of these two techniques is difficult to observe in the generated adversarial examples, their results certainly prove their effectiveness.

Understanding the Poincare ball and Poincare distance is not a simple task for someone who is not acquainted with non-Euclidean geometry. Nevertheless, when boiled down to it’s core contribution: distances between points grow exponentially nearer the surface of the ball, it makes sense as a solution to the vanishing gradient problem and noise curing.

While the triplet loss function is rather intuitive, it is simultaneously rather genius as a way to promote classification farther away from the true class.

My main reservation with the paper’s contributions is not whether they accomplished their goals, but if they really had the right goals in mind. Although the authors clearly state that the intended use for such an attack is to evaluate deep neural networks and increase their robustness to such attacks, I can’t help but feel that they are simply creating problems for others to solve. It is similar to white-hat hackers, who intend to help with cybersecurity efforts by finding vulnerabilities in good faith, although those issues are often easily solved, as opposed to designing adversarially trained networks.

Reference

[1] M. Li, C. Deng, T. Li, J. Yan, X. Gao, and H. Huang, “Towards Transferable Targeted Attack,” 2020. [Online]. Available: https://openaccess.thecvf.com/content_CVPR_2020/papers/Li_Towards_Transferable_Targeted_Attack_CVPR_2020_paper.pdf.

[2] Y. Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting Adversarial Attacks with Momentum,” 2018. [Online]. Available: https://openaccess.thecvf.com/content_cvpr_2018/papers/Dong_Boosting_Adversarial_Attacks_CVPR_2018_paper.pdf.