Many Deep Nets Are More Similar Than You Might Think

(And That’s Why Transfer Attacks Are Easy)

Published in

Five Blog

22 min readOct 10, 2022

In a previous post, we saw that a deep network’s adversarial examples¹ just represent the input features that the net uses to perform its task. In this post, we’re going to take that viewpoint further. We’re going to see how easily we can transfer adversarial examples between different networks, and so find out how similar those different networks’ features are. The answer? Pretty easily, and pretty similar.

We’ll develop a simple adversarial attack based on taking features from one net and using them to extract predictable responses from a second one. “Transfer” (or “surrogate-based”) attacks of this type already exist. But here, we’ll be demonstrating one that’s very simple and works very well. It all comes down to (a) beginning with the assumption that different deep nets are actually learning the same features, and (b) knowing what to do with that fact. When our attack succeeds, we’ll know that our assumption was pretty good.

If two classifiers have near-coincident decision boundaries, then they’re similar in the way that counts most.

Note that this transfer of features between different networks represents a crucial, fundamental similarity between those nets. After all, it means that they’ve learned similar functions at and around their input images, and that’s what matters most. The converse is true as well: if those two different nets are actually implementing similar functions, then an adversarial example of one net should produce a similar response on the other. We’ll be making use of that fact.

(If the relationship between adversarial examples and features isn’t already clear, read the previous post and then come back! Also, if you want a 5-minute video summary of this post, we’ve got one right here.)

Transfer Attacks: A Primer

Transfer attacks, where adversarial perturbations are calculated on one net and then applied to another one, aren’t new at all. The first paper to point out the existence of adversarial examples in computer vision had a section in which they tried feeding examples derived from one net into another net, to see whether the second net would also be fooled. This is probably the simplest transfer attack one could imagine, and while it didn’t work perfectly, it worked much better than it would have had the nets not had a lot in common. Note that we would never expect something like that to work were there not a deep connection between the two networks’ behaviours. (Nowadays, we can talk about this underlying assumption in terms of the two networks sharing feature responses, but this was less clearly understood at the time.)

As people studied adversarial examples more, they became interested in transfer attacks in the “black-box” setting. “Black-box” means that none of the net’s internals are visible: you put an image in, and you get a classification out. That might just be a predicted class ID (“decision”), or it might be the confidence(s) of one or more classes (“score”). Either way, you can’t see the net’s parameters or architecture, and so you can’t calculate the gradients that are used in “white-box” attacks. (If it’s not immediately clear why gradients are so key to adversarial attacks, then, again, check out the previous post.) You have to come up with another way of estimating that network’s behaviour near to the inputs, so you can estimate which way to go when evolving your adversarial example.

**Top:** The typical situation of a classifier taking a “clean” image of a cat and correctly outputting the cat label. In this situation, we cannot see any of the net internals, but we see the scores assigned to each of the output classes. The highest score has correctly been awarded to “cat”. **Bottom:** Given the same level of access to the network, our goal as the attacker is to replace the original cat image with a very similar image that changes the classification: in this example, the maximum score now goes to “dog”.

If you’re the attacker in that situation, then one of your problems is that estimating gradient information is really expensive. That white-box gradient that PyTorch will serve up with a single pair of forward/backward calls has D entries in it, where D is the dimension of the input space. For images, that’s high. To estimate the whole thing numerically from the black-box output, you pay the O(D) cost of doing it entry by entry, which just isn’t practical. Without a prior to guide you, you’ll be waiting a very long time just to get the gradient at a single point.

So people began looking into what prior information they could incorporate, and how. And they realised that a great source of prior information on the gradients of a network that you don’t have access to are the gradients from another network that you do have access to. So long, that is, as you have reason to believe that those networks might be similar to one another. Then, you can use the one you have (the “surrogate”) to make informative proposals about the one you don’t (the “victim” or “target”)². This is like having someone point to where you should look, instead of leaving you to grope around in the dark yourself.

The framework for solving the problem of how to attack a black-box network (at top): use another network, the surrogate (at bottom), that’s fully known to us. If we want to use gradient descent, we can simply compute gradients on the surrogate. The ultimate question is how well the surrogate represents the black-box target. We’ll show that the answer is “very well, for many networks”.

But this all rests on quite an assumption: that we can somehow get our hands on a network similar to the target, the one we’re actually interested in analysing. If the target net is unknown to us, then how can it somehow also be known to us?

Let’s now take a little time to talk about how people think about differences between networks, and which of those differences do and don’t matter in principle and practice. Then, we can come back and put together our simple attack and find out how well our big assumption actually works.

How Nets Are Different, and How They’re Not (And What That Even Means)

It’s easy to assume that different deep nets are… well, different. That’s true even if we’re just talking about, say, classification CNNs trained on ImageNet using supervised learning. There’ve been a lot of architectural developments since AlexNet, and the general premise of each one was that it was offering something new (and in some sense better) than its predecessors.

Over time, architectures have mutated as layers have been expanded, rearranged, and reconnected. Training datasets have grown through expanded collection and augmentation. Optimisers have seen tweaks to their parameters and to themselves, with the fittest specimens surviving. And, most importantly, the resulting networks have successively produced state-of-the-art accuracies on the test sets on which we’ve all agreed to compare them. Particularly ImageNet. Even if we were to concern ourselves only with the final output, i.e. the function that actually gets learned, we’d have to say that something has been changing as the field has gone through this process of development.

The progress of state-of-the-art top-1 accuracy on ImageNet over the past decade. Taken from here.

But here’s a serious question: how much was actually changing through all this? How different from one another are the functions those models are implementing, really?

Take, for example, the simple matter of models getting bigger. We’ve known for a while now that classical notions of network size/capacity³ just don’t tell us much when it comes to the actual expressivity of a net and how we should expect it to generalise from its training data to its test deployment. In many respects, even “small” nets are already “way too big” for some of the problems they’re trained on: they’re perfectly capable of memorising the entire training set even if it’s nothing but noise! And we also know that a bigger teacher, once it’s learned what it’s learned, will typically find it possible to teach that lesson to a much smaller student: this goes by the name of “distillation”. Whether deep nets even “need to be deep” depends somewhat on what you mean by that. Optimisation is one thing, and representation is another.

On top of this, we also know that different nets empirically tend to make the same sorts of mistakes, including the same sorts of funny mistakes. In observing their behaviour on natural examples, different CNNs have been seen to exhibit a common tendency to focus on texture rather than higher-level shape cues, and classes that a given net finds “hard” or “easy” (i.e. gets lower or higher accuracy on) tend to be regarded the same way by other nets.

A nice demonstration of how insensitive a vanilla VGG-16 is to high-level structural information: the scrambled sample images are still classified with high accuracy on the basis of texture. From the paper “Approximating CNNs with Bag-of-Local-Features Models Works Surprisingly Well on ImageNet”, by Brendel and Bethge (Fig. 5): highly recommended reading.

Adversarial vulnerability gives us another perspective on this. The very fact that the phenomenon exists shows us that a network that gives a high accuracy reading on a given test set is only a tiny set of perturbations away from giving an arbitrarily low one. That is, the difference between these nets isn’t down to fundamentally deeper understanding of their inputs, but differences in the way they split statistical hairs. And as we hinted above, the fact that one might think that transfer attacks would work at all comes down to an underlying assumption that something pretty similar is going on under the hood. (It would certainly be difficult to fashion a universal adversarial perturbation were that not the case, and yet, it’s very easy.) There is convincing evidence in favour of a “universality hypothesis” stating that different deep nets will converge to learning the same low-level features. And if it turned out that those nets were associating those shared low-level features directly with their output classes, then that would make them pretty similar indeed.

Just another ordinary day of recognising frogs. That’s what happens when you associate a low-level feature directly with class identity.

Walking the Walk (Or At Least Putting Our Shoes On)

We’re saying some arguably provocative things here. It’s time we started backing them up. We don’t want to speculate that different nets are similar: we want to demonstrate it. What’s standing in our way?

Well, as we’ve said, we’re looking to compare functions to one another. Right off the bat, we run into the fact that there’s no one “correct” way of doing that. There are multiple ways of measuring the distances between points⁴, never mind functions, and which one is “best” depends on what we intend to capture. On top of that, the classification functions we’re talking about map very high-dimensional image space (which is for most intents and purposes continuous, albeit bounded) to categorical distributions which might themselves represent a large number of categories. Take a standard implementation of an ImageNet Inception-v3: that’s a function that maps R²⁹⁹*²⁹⁹*³ to a 1000-D categorical distribution. How are we supposed to make meaningful comparisons between these sorts of beasts?

But we run into this issue all the time when we do supervised learning in the first place, don’t we? In principle (more on that shortly), we have an inexhaustible space of images, along with an oracle labeller of those images, and we’re trying to find the classification function that behaves as much like that oracle labeller as possible, over that entire space. But we have to define what exactly that’s supposed to mean, in a way that we can actually compute.

The field has a standard answer to that question: the expected KL-divergence between the oracle’s distribution and the classifier’s. In the supervised learning context, we commonly see it appear in its simplified “log loss” form⁵. But in principle, it’s still the KL-divergence that’s being optimised, and if we were looking to try to measure the difference between two arbitrary distributions, we’d need the full expression. For two categorical distributions p and q representing the probabilities of different labels c (from a label set C) at a single given input xᵢ, that expression would look like this:

Again, that’s just to compare the output distributions p and q from two different classifiers at a single input point. What we actually want is the expectation of that quantity over all of the inputs we’ll ever see. Now, do we actually integrate that over all of extremely high-dimensional image space?Heavens, no. In supervised learning, we typically just add it up over a finite list of samples: the images in the training set. It’s an approximation, but a workable one, and one that seems to work pretty well, overall.

This is the point at which we want to start picking at this framework a bit.
We’re going to think more about that loss function (the KL-divergence), and the pseudo-integration of it (just adding it up over the sample points). And we’re going to do it from the perspective of someone who wants a practical way of understanding how alike two classification functions are.

The KL-divergence is useful in the context of trying to optimise one distribution to match another observed distribution: it is non-negative everywhere, and zero if and only if the two distributions are identical. So as the criterion for an optimiser in that situation, it works well: we just want it to go down at all of the points where we’re evaluating it.

But if two given distributions aren’t the same, does it give us a useful reading of “how different” they are? Well… not really. For one, it isn’t even a distance metric: it isn’t symmetric, and it doesn’t satisfy the triangle inequality. So function F isn’t the same “distance” away from function G as function G is from function F. And even if F is close to G and G is close to H, F might be far from H! Now, you can work around some of this by choosing measures designed to satisfy criteria like these, but that’s actually not the main point. This isn’t really about faulting the KL-divergence as a choice of measure, it’s more down to the fact that comparing distributions is inherently complex and dependent on intent. There’s a subjectivity that you can only resolve by knowing what you’re trying to do.

Then there’s that issue of the impossibility of full integration, and the need to use sampling to try to approximate that integral that represents how good the classifier is over its entire input space. When we “integrate” the loss function over the training set during optimisation, do we believe that this accurately represents the integral over the whole input domain? Of course we don’t. Test accuracies are generally lower than training accuracies, and everyone knows this. The estimated loss integral is a understatement of the “true” one: it does not accurately generalise.⁶

A basic example of training curves demonstrating a network trained to the point of overfitting (taken from here). Because the training framework is fundamentally about matching a given sample set, efforts to force training error below a certain point will inevitably lead to overfitting to that set, assuming sufficient expressivity of the network (a justified assumption). This just illustrates the gap between the losses on the known training and validation sets, not to speak of what can happen on other unknown sets.

As we’ve said above, the common framing of supervised machine learning involves matching an input distribution at some sample points (subject to some regularisation, most of which is in practice implicit). The learned classifier never models the “true” distribution (whatever that even is): it models the set of samples it’s actually been given, and extrapolates its behaviour away from them however it happens to. So, right off the bat, we know that our sampling isn’t truly adequate to begin with, and so, if we ran different models on different test sets, we’d expect different results each time. This is already a problem when it comes to comparing model behaviour.

The adversarial perspective takes this point even further. It shows us that it’s not just that things would change if we picked a different set of samples in image space: it’s that they would change a lot if we picked a different set of very nearby examples in image space. And one of our core assertions here is this: if the outputs of one classifier can be made to match those of another classifier just by feeding one of them slightly perturbed versions of the other’s inputs, then you could consider those classifiers to be fundamentally similar to one another. Further, if those slight perturbations to the inputs of one classifier could be predicted using the other classifier, then you should consider those two classifiers to be very fundamentally similar to one another. After all, they locate crucial decision boundaries in approximately the same places, and reliably so.

We can’t truly integrate the functions that define classifier behaviour: it just isn’t possible. And we’ve already talked about some of the issues with the approximations we have to use. But we can at least move on from evaluating their behaviour only at a specific list of (often predetermined) points, and start talking about their behaviour at and around those points. That is, we can move from a zeroth-order evaluation to more of a first-order one.⁷

And in doing so, we can more concretely make the point that developments in classification technology shouldn’t only yield marginal differences which essentially represent re-tunings of the thresholds being used to split statistical hairs. They should yield fundamentally different, and more robust, perspectives on what characterises their inputs. Now, we’re going to move on to show you how to construct an actual algorithm to reveal the extent to which that is or isn’t happening, between two given classification functions.

Building a Simple and Effective Transfer Attack

Let’s pause and take stock of what we know at this point. We understand the relationship between features and adversarial examples. And because of this, we believe that if two networks are very similar, then we should be able to use the features from one to attack the other, easily. The converse is true too: if having one of the nets makes attacking the other net easy over an entire input set, then that demonstrates that the two nets are very similar to each other in a crucial sense.

Great. Let’s start attacking/demonstrating network similarity. We’re going to do this within the framework of a “surrogate- and score-based black-box attack”. That is, we’re going to attack one net (the target) which gives us nothing but its output scores, using another network (the surrogate) which we have full access to.

Let’s try to make the algorithm as simple as we can, but no simpler. We’ll take the approach of starting with something basically conceptually solid but too simple, and adding to it until it works as well as we want it to.

As we said earlier, as soon as adversarial examples were discovered in visual classification, people tried to use examples from some nets on other nets directly, to see whether they’d “just work”. And they sometimes did! Based on everything we’ve talked about above, that shouldn’t be surprising at all. If anything, it’d be surprising had that not worked: the different network flavours were similar, and this was a straightforward feature transfer between them. We might visualise a successful transfer of that sort like this:

At left and right are two different classifiers that have been trained on the same dog/ostrich task we used as an example in our previous post. There are incidental differences between these classifiers, which can be seen in their slightly different decision boundaries. But they fundamentally respond to the same task in the same way, which includes the way the right classifier sees the adversarial image **I + ∆I** that was originally derived on the left classifier.

We’re happy to start with this basic approach, as it captures the essence of our own thinking. Let’s put down the pseudocode of a very simple algorithm which just takes a single big step (however large is permitted) in the direction of the surrogate gradient and sees whether that works as an adversary on the target. (When we talk about “gradients” here, we mean of any adversarial loss function normally used when computing adversarial examples. The difference between the scores of the ground-truth class and a spurious target class is a good and common choice.) If we have more than one surrogate, we pick from the set randomly until one works. This is just a “Fast Gradient Method” transfer attack:

This alone will work some of the time, but only some of the time⁸. Do the failures indicate profound network difference, then? No. Consider the following simplified 2D example:

In this case, the curved decision boundary learned by the left classifier is slightly different from that learned by the one on the right. Because of this, not only does **I + ∆I** not work as an adversarial example on the classifier on the right, neither does any rescaled version **I + s∆I** of it. The direction **∆I/||∆I||** must itself be adjusted, though only slightly. (If you’re wondering how realistic this depiction of a curved decision boundary is, read this.)

One of the key issues here is that this optimiser is extremely crude. There’s no iteration to correct the direction estimate at all. That’s not really how we optimise anything, generally. Even in simple first-order approaches, we don’t pick a direction and go for broke. We take a more moderate step, then recalculate our search direction, and take another. That is, we do iterative gradient descent. And if we want to keep our total perturbation size within a limit (as we often do with adversarial examples), then we project our perturbation back onto our constraint ball. That is, we do “PGD”: projected gradient descent.

So let’s update our algorithm accordingly:

The highlighted bits show what we’ve added compared to the fast-gradient transfer of Algorithm 1, to implement PGD transfer instead. Basically, we’re just taking smaller steps of length ε each time, while keeping ourselves projected (by operator Π) back to the norm bound ν. Also, each surrogate in the set gets a chance to generate a direction from each intermediate step if nothing has worked yet, so we reset our set each time.

So again, we have a pretty standard gradient-descent algorithm at this point, but with the twist that the gradient directions actually come from the surrogate network instead of the target itself. As we’ll see shortly, this already works really well most of the time: that little bit of added “optimiser sanity” gets us a long way, compared to just trying to port an adversarial example directly.

But this algorithm still suffers from a limitation that may prove fatal in some cases. At any given step, it only has a single candidate direction per surrogate: the surrogate’s gradient. If there’s a single attempted step that doesn’t improve the target objective for whatever reason, then the optimiser is just stuck. And this can and does happen: after all, we’re claiming that these functions are very similar, not identical. From time to time, at a certain place, the two nets might not quite respond to the same feature in exactly the same way. And we’re working with local linear approximations of nonlinear functions to begin with, which is another potential point of optimisation failure.

We want to stick to this solid core algorithm, and the solid core intuition of straightforward feature transfer between nets, as closely as we can. We could work harder on this simple optimiser, e.g. by making the step length dynamic, but we’re not even going to make the implementation that complicated. We just need to introduce a little bit of the right sort of variety to keep things moving.

We’ll do this in two ways. The first one is really simple: in addition to trying a forward step along the surrogate gradient direction, we’ll also try a backward one⁹. If that doesn’t work, we’ll use the surrogate to propose a direction other than its own loss gradient, but which still represents a feature or blend of features that the surrogate finds significant. This is still all about feature transfer, after all, and if something makes our surrogate react, we suspect it’s going to make our target react too. Crucially, we don’t want to start blindly searching image space, because that will likely take ages. So instead, we’ll search in the local space of directions that change the surrogate’s output distribution in some way. Those directions are the class-score gradients. When stacked into a matrix, they’re called the Jacobian, and the space they span is called the coimage.

A very simple example of the coimage of a 2D->1D linear map F. The only thing that determines where in the 1D output space (the image) that an input vector v is mapped to is its component in a 1D subspace of input space (the coimage). Any component of the input in the subspace orthogonal to the coimage (the kernel) is mapped to zero. Note that in the case that the image is the space of output class scores and the coimage is the subspace of image space that effects any change in those scores, searching in the kernel is a complete waste of time if the intent is to change a class score.

We could search that space in different ways, but we’re keeping it simple and lazy, so we’re just going to do it randomly. (This is called “ODS”, but it’s just one way of doing it.) Conceptually, it looks like this:

Our final algorithm. Ideally, the target’s own gradients (“true”, at right) would be used for gradient descent, but those aren’t accessible. So the surrogate suggests either its own gradient (“sur”) or a coimage sample (“co”). The method sticks to surrogate gradients whenever possible, which is basically just a transferred PGD attack. This is what happens most of the time. You can see the sequence of steps the optimiser has taken so far on the right side of the image.

And so, without too much work, we end up with our algo, “GFCS” (for “Gradient First, Coimage Second”):

To get to the final GFCS algorithm, we just add the highlighted bits to the PGD transfer version. Instead of failing when we run out of gradient transfer directions, we instead revert to a block which implements ODS, given by the formula torwards the right. It’s a simple method for randomly sampling directions from the coimage. (The blue highlight shows the addition of the forward/backward SimBA approach. Don’t worry about it too much.)

That’s it. Let’s run it.

How Well This Works

As before, there are other transfer/surrogate-based attacks out there. And there are other transfer attacks that adopt the same setting we have: access to the target’s output scores, and full access to a surrogate (or a set of them). We collected the best of those approaches, and ran them against the algorithm we just developed above. We used the experimental setup they used, including using the same networks. The comparison is based on (a) how many times the surrogate needs to be queried to successfully attack the target (fewer is better), and (b) what fraction of the input set the attack works on within a fixed query limit (higher is better).

Here’s the main result, as a table of median query counts and success rates. Note that even with success rates that are nearly perfect, query counts are very low:

The black-box ImageNet networks being attacked are given at the top of each column. The 1-surrogate trial uses ResNet-152, while the 4-surrogate one alternates between VGG-19, ResNet-34, DenseNet-121, and MobileNet-v2 (to replicate the experiments from the competing papers). LeBA in particular is a much more complicated method which involves, among other things, trying to learn the surrogate. The “GF, no CS (ablation)” rows show what happens if you use gradient transfer only, without the coimage backup: the main effect is to increase the failure rate, but with the 4-surrogate set, even that reduction in success is modest. For full plots of success rates against queries per image, see Figure 2 in the paper.

How were we able to achieve results like these so easily? We’d explain it like this:

We understand that adversarial examples are features and vice versa.
We basically believe in the universality hypothesis, and recognise empirical results that indicate that nets lean on those universal features in similar ways. We get that this means that their end-to-end behaviours will thus be expected to be similar in a predictable and accessible way.
We understand that “similar” does not mean “identical”, and there are always going to be incidental differences. The optimiser doesn’t need to be fancy, just fit for purpose.

One more thing: we mentioned that that “backup” coimage search block was our way of “unsticking” the optimiser, and that this was required to prevent failure. But how much failure? How far do we get using only that really simple gradient transfer block, which is all the algorithm actually does when it isn’t unsticking itself? There are two lines in the table that answer that question: the ones listing “GF, no CS (ablation)” as their method.

To get the high success rates that the full method enjoys, yes, some unsticking is required. This is especially important in the case of using a single network as the surrogate. But even then, basic “transferred gradient descent” often works just fine on its own. And once we get some more variety in the suggested gradients by using a small set of surrogates, the accuracy of the GF-only version goes way up. Just as we’d have expected.

That’s pretty much it. More results and details are of course in our paper, “Attacking deep networks with surrogate-based adversarial black-box methods is easy”.

Recap (and Some Closing Thoughts)

Boiling it all down, what we’ve done here is pretty simple:

We tested our belief that different classification nets were doing very similar things to one another, within an adversarial attack framework.
We did this by simply using one of them to supply the directions that predicted how the other one would react.
The fact that this worked and worked really well vindicated our belief in similarity, since our algorithm bet the farm on it and couldn’t have worked had it been wrong. It’s literally incapable of searching outside the span of the surrogate’s class-score gradients, and it almost always just uses the loss gradient directly.
Throughout, we were guided by our understanding that adversarial examples are just features, and that a well-designed transfer attack is ultimately just a test of similarity between the surrogate and the target.

In closing, we’ve been talking a lot about “network similarity” and “universality” here, but there’s a thought we want to leave you with: none of this is inevitable. It’s not that the phenomenon we’re demonstrating is inherently true of different networks. It’s that it’s empirically true of many. We do believe that many variations on network architecture and training regimes represent minor variations on what is ultimately the same technology, and that this incrementalism comes through in the way those “different” networks actually behave. But we still believe that there are more fundamentally different technologies to be uncovered. Our goal is to find them, and what we’re describing here is part of a framework that we hope everyone will use in evaluating how far that goal has actually come.

Thanks for reading!

Footnotes

[1] Strictly speaking, adversarial examples and adversarial perturbations aren’t quite the same thing. By “example”, we’re typically referring to the sum of the perturbation ΔX with its source image X, i.e. X + ΔX, which will then be fed to the net. The perturbation ΔX is the “feature” being added, though there will typically be some features recognised by the net in the “clean” input X as well, in much less “pure” form (as this is how classification normally takes place). We may speak of the adversarial example X + ΔX representing a feature so as to avoid having to deal with this subtlety, but this distinction exists for those who are interested.

[2] You could get the impression from the literature that there’s some sort of difference between “transfer attacks” and “surrogate-based attacks”, because of terminological split. There really is no meaningful difference between them.

[3] VC dimension et al. See the reference for the discussion. Just counting parameters is perhaps even worse.

[4] Take the entire class of p-norms, where 1 <= p <= Inf.

[5] To see how optimising the log loss in the case of one-hot ground-truth labelling is a special case of optimising the KL-divergence, begin with the full expression of the problem:

Note that since the KL-divergence isn’t symmetric with respect to p and q, it matters which distribution is which. To get the log-loss expression, p will be taken to be the ground-truth label distribution.

Expanding the log quotient gives us two terms, where one of the terms (the entropy) doesn’t depend on the distribution being optimised at all. So, we can write the whole problem in terms of the other term (the cross-entropy):

In the special case where the distribution p(cⱼ|xᵢ) is 1 where cⱼ = cᴳᵀ and 0 elsewhere (i.e. one-hot labelling), that expression then further reduces to the common log loss:

[6] This is the point at which some people will begin to use terms like “distribution shift”. That is, they will assert that the issue is that “the distribution” has “changed” in going from the training set to the test set, accounting for the classifier’s worsened performance on the latter. But this is an inadequate explanation of the more fundamental issue that we are talking about here. If “distribution shift” were the true and full issue, then given a predefined train and test set, one could overcome it completely just by mixing the two sets and then resampling the split from the mixed distribution. The retrained classifier would then work perfectly on both, as there would be no “shift”. This, however, will not actually work, because the truth is that the generalisation gap is down to a form of overfitting of an undersampled version of reality. (Go ahead and try it.)

[7] This is the point at which someone might ask, “Well then why don’t you do something simple like take the dot products of the respective gradients?” But it isn’t that simple. For one, there is the issue of obfuscated gradients: extremely different analytical first-order properties do not actually imply meaningfully different zeroth-order properties over a region (hint: noise). We are looking for a more regularised comparison of that information. But even setting aside that issue, remember that we are dealing with nonlinear functions (even if they are often very well modelled by local linear models). Consider what gradients look like in the vicinity of the peak of a hill function, and think about the way their directions change w.r.t. tiny perturbations near that peak. We are ultimately interested in shared effects, not identical gradients: the gradients are a tool in a larger, more reliable analysis.

[8] See Table 2 in this paper for the original experiment of this type.

[9] Something similar to this was popularised in the paper that proposed the SimBA attack. Now, why this would be a helpful thing to try in this case might be counterintuitive: see this paper for a discussion of why this can and does often happen, as well as discussion of so much more.

Acknowledgements

Thanks to John Redford and Hossein Bahari for helpful comments on earlier drafts of this blog that made it much better than it would’ve been otherwise.

Thanks as well to authors whose images have been reproduced here for purposes of commentary, as noted in the corresponding captions. Besides these, some well known images have been reproduced from this paper, as in the previous blog entry. The canonical network image used in the architecture diagram was originally derived from this package. We are once again pleased to collaborate with Struthio molybdophanes and Smudge the Cat.