In previous part, we understood what is and adversarial attack and how it can be classified based on various attributes. (In case you missed it, you can read it here). In this part, we will study some of the most common types of attacks in detail.
In this part we will overview some of the most common attacks on image classifiers and implement a very popular method called FGSM (Fast Gradient Sign Method) attack proposed in Goodfellow et al. and understand how it works as well as how it challenges the notion of non-linearity of neural networks being the reason behind success of adversarial attacks.
This was one of the earliest attacks where Szegedy et al. first discovered the vulnerability of deep visual models to adversarial perturbations by solving for the following optimization problem:
where we are trying to minimize ρ (which is the adversary signal) with second-norm. If we look closely, this equation is similar to equation described in part — 1, with norm-value p=2.
For this problem, approximate solution was computed by Szegedy et al. with the Limited Memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS) algorithm, upon which this method is named. However, solving this equation for large number of examples, is computationally prohibitive, which is addressed in next method. That is when the next method comes into picture.
The FGSM is among the most influential attacks in the existing literature, especially in the white-box setup. Its core concept of performing gradient ascend over the model’s loss surface to fool it, is the basis for a plethora of adversarial attacks. Many follow-up attacks can be strongly related to the original idea of FGSM. The most common image showing adversarial attacks has been of FGSM attack:
The FGSM is a one-step gradient-based method that computes norm-bounded perturbations, focusing on the ‘efficiency’ of perturbation computation rather than achieving high fooling rates. Goodfellow et al. also used this attack to corroborate their linearity hypothesis, which considers the linear behavior of the modern neural networks in high dimension spaces (induced by ReLUs) as a sufficient reason for their vulnerability to adversarial perturbations. At the time, the linearity hypothesis was in sharp contrast to the developing idea that adversarial vulnerability was a result of high ‘non-linearity’ of the complex modern networks.
Goodfellow et al. claimed that adversarial examples expose fundamental blind spots in our training algorithms. They also claimed that linear behavior in high-dimensional spaces is sufficient to cause adversarial examples. And using this, they designed a fast method of generating adversarial examples that makes adversarial training practical.
Linear explanation of adversarial examples
We start with explaining the existence of adversarial examples for linear models. Since the precision of the features is limited, we will see how the classifier can be forced to respond differently to an input x than to an adversarial input x̃ = x + η if every element of the perturbation η is smaller than the precision of the features.
Adversarial attacks for linear model
Thus, for high dimensional problems, we can make many infinitesimal changes to the input that will add up to one large change to the output.
Linear Perturbation of Non-Linear Models
The linear view of adversarial examples suggests a fast way of generating them. We hypothesize that neural networks are too linear to resist linear adversarial perturbation. LSTMs, ReLUs, and maxout networks are all intentionally designed to behave in very linear ways, so that they are easier to optimize. More nonlinear models such as sigmoid networks are carefully tuned to spend most of their time in the non-saturating, more linear regime for the same reason. This linear behavior suggests that cheap, analytical perturbations of a linear model should also damage neural networks. Let us see how adversarial examples can be generated for the neural networks.
We refer to this as the “fast gradient sign method” of generating adversarial examples. Note that the required gradient can be computed efficiently using backpropagation.
Code Implementation for FGSM
Let us see how we can implement this in code. We will be using code from this link, which is part of tensorflow official documentation. We will analyze this function ‘create_adversarial_pattern’, as it implements the crux of the paper, i.e. calculates the gradient sign.
def create_adversarial_pattern(input_image, input_label): with tf.GradientTape() as tape: tape.watch(input_image) prediction = pretrained_model(input_image) loss = loss_object(input_label, prediction) # Get the gradients of the loss w.r.t to the input image. gradient = tape.gradient(loss, input_image) # Get the sign of the gradients to create the perturbation signed_grad = tf.sign(gradient) return signed_grad # codel_url: https://www.tensorflow.org/tutorials/generative/adversarial_fgsm
Using below lines, we are basically asking tensorflow to keep track of computations related to ‘input_image’
with tf.GradientTape() as tape: tape.watch(input_image)
Using below two lines, we are making predictions for ‘input_image’, and calculating the loss related to this prediction.
prediction = pretrained_model(input_image) loss = loss_object(input_label, prediction)
Using below lines, we are calculating the gradient of loss wrt input image, this part contributes to the ‘gradient’ term in the FGSM (Fast Gradient Sign Method)
gradient = tape.gradient(loss, input_image)
Once we have gradients, we need to calculate use ‘sign’ function of gradient, i.e, sign(gradient), and we have ‘gradient sign’ term in FGSM (Fast Gradient Sign Method). Below is the input image, for which we are calculation adversarial perturbation (η).
Calculated adversarial perturbation (η) using FGSM method comes out as:
Adversarial image to fool the model is calculated using the below code:
adv_x = image + eps*perturbations
Here is what the generated result looks like for different values of ϵ
Goodfellow et al. concluded the following things as a result of this experiment:
- Adversarial examples can be explained as a property of high-dimensional dot products. They are a result of models being too linear, rather than too nonlinear.
- The generalization of adversarial examples across different models can be explained as a result of adversarial perturbations being highly aligned with the weight vectors of a model, and different models learning similar functions when trained to perform the same task.
- The direction of perturbation, rather than the specific point in space, matters most. Space is not full of pockets of adversarial examples that finely tile the reals like the rational numbers.
- Because it is the direction that matters most, adversarial perturbations generalize across different clean examples.
- Some other attacks that built on FGSM attack or had similar methodology are: Basic Iterative Method (BIM), The Projected Gradient Descent (PGD) attack,The DeepFool Attack, The C&W Attack.
- JSMA & One-pixel Attack: Whereas most of the early attacks focused on perturbing a clean image holistically by limiting perturbation by minimizing second-normor infinity-norm of the perturbations, the Jacobian-based Saliency Map Attack (JSMA) and One-pixel attack deviate from this practice by restricting the perturbations to smaller regions of the image.
- Universal Adversarial Perturbations:The above-mentioned methods compute adversarial perturbations that fool a target model on a specific image. Moosavi-Dezfooli et al. focused on computing image-agnostic perturbations that could fool the model on any image with a high probability as shown in below image:
Thus, we studied FGSM attack in details and understood its implementation in tensorflow. We also overviewed some other attacks on image classification. In further parts, we will be moving beyond classification and seeing how adversarial attacks can be performed on other tasks such as Face Recognition, Object detection, Object Tracking and how they can affect the real world in several ways.
Leapfrog your Enterprise AI adoption journey
Vishal is an experienced Data Scientist with over two years of experience in the field. He holds an M.Tech in Electrical Engineering from IIT Gandhinagar, where he specialized in Deep Learning and presented his research work on the classification of craters with ML using Chandryaan-1 data at IPSC 2020. He has worked on a diverse set of industry problems ranging from computer vision (such as object detection, segmentation, image classification and generation, OCR) to NLP (such as document classification, Named Entity Recognition, and Question-Answering). Currently, he is working as a Data Scientist at Subex AI Labs, where he applies his expertise to various deep learning use cases.