We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range of unforeseen attacks.
Modern neural networks have achieved high accuracies on a wide range of benchmark tasks. However, they remain susceptible to adversarial examples, small but carefully crafted distortions of inputs created by adversaries to fool the networks. For example, the adversarial example with L_\inftyL∞ distortion below differs from the original image by at most 32 in each RGB pixel value; a human can still classify the changed image, but it is confidently misclassified by a standard neural network.
Method Principles :-
We’ve created a three-step method to assess how well a model performs against a new held-out type of distortion. Our method evaluates against diverse unforeseen attacks at a wide range of distortion sizes and compares the results to a strong defense which has knowledge of the distortion type. It also yields a new metric, UAR, which assesses the adversarial robustness of models against unforeseen distortion types.
1. Evaluate against diverse unforeseen distortion types
Typical papers on adversarial defense evaluate only against the widely studied L_\inftyL∞ or L_2L2 distortion types. However, we show that evaluating against the L_pLp distortions gives very similar information about adversarial robustness.[2]
The L_pLp distortion allows an image viewed as a vector xx of pixel values to be distorted to a vector x^\primex′ such that xx and x^\primex′ are close in the L_pLp norm.
We conclude that evaluating against L_pLp distortions is insufficient to predict adversarial robustness against other distortion types. Instead, we suggest that researchers evaluate models against adversarial distortions that are not similar to those used in training. We offer the L_1L1, L_2L2-JPEG, Elastic, and Fog attacks as a starting point. We provide implementations, pre-trained models, and calibrations for a variety of attacks in our code package.
2. Choose a wide range of distortion sizes calibrated against strong models
We found that considering too narrow a range of distortion sizes can reverse qualitative conclusions about adversarial robustness. To pick a range, we examine images produced by an attack at different distortion sizes and choose the largest range for which the images are still human-recognizable. However, as shown below, an attack with a large distortion budget only uses it against strong defenses. We recommend choosing a calibrated range of distortion sizes by evaluating against adversarially trained models (we also provide calibrated sizes for a wide variety of attacks in our code package).
Sample images (espresso maker) of the same strong attack applied to different defense models. Attacking stronger defenses causes greater visual distortion.