Does your NLP model able to prevent adversarial attack?

Adversarial Attack

Published in

HackerNoon.com

3 min readJun 30, 2019

Adversarial attack is way to fool models through abnormal input. Szegedy et al. (2013) introduces it on computer vision field. Given a set of normal pictures, superior image classification model can classify it correctly. However, same model is no longer classify input with noise (not random noise).

Left: Original input. Middle: Difference between left and right. Right: Adversarial input. Image classification model classify left 3 inputs correctly but model classify all right inputs as “ostrich”. (Szegedy et al. 2013)

In natural language processing (NLP) field, we can also generate adversarial example to see how your NLP model resistance to adversarial attack. Pruthi et al. use character level error to simulate adversarial attack. Performance of state-of-the-art model achieve 32% relative (and 3.3% absolute) error reduction.

Architecture

Pruthi et al. use semi-character based RNN (ScRNN) architecture to build a word recognition model. A sequence of words will feed into the RNN model. It does not consume whole word but splitting to prefix, body and suffix.

Prefix: First character
Body: Second character to second last character
Suffix: Last character

Does your NLP model able to prevent adversarial attack?

Adversarial Attack

Architecture

Written by Edward Ma