Are all CNNs created equal?

Irrespective of architecture, CNNs recognise objects with similar strategies: but humans use a very different one.

Robert Geirhos
Towards Data Science

--

We know so much and yet so little about convolutional neural networks (CNNs). We have access to every single model parameter, we can inspect every pixel of their training data, and know exactly how the architecture is formed — yet understanding their strategies for recognising objects has proven surprisingly challenging.

Understanding the strategy is crucial: We can only trust CNNs to recognise cancer from X-ray scans or to steer autonomous vehicles if we understand how CNNs make decisions — that is, which strategy they are using.

We here introduce error consistency, a simple analysis to measure whether two systems — for example two CNNs, or a CNN and a person— implement a different strategy. Using this analysis, we investigate the following questions:

  • Are all CNNs “created equal” (implementing similar strategies)?
  • Does the strategy of a recurrent CNN differ from a feedforward one?
  • Do humans use the same strategy as CNNs?

Motivation: erring is telling

So you want to train a neural network to distinguish puppies from people. Maybe you’d like to train a system that opens the door when your little puppy arrives but keeps strangers out, or you are the owner of an animal farm where you want to make sure that only people can get into the house.

In any case, you take your favourite CNNs (say, ResNet-152 for performance and AlexNet for good old times’ sake) and train them on a puppies-vs-people dataset scraped from the web. You are relieved to see that each of them reach about 96-98% accuracy. Lovely. But does similar accuracy imply similar strategy?

Well, not necessarily: even very different strategies can lead to very similar accuracies. However, those 2–4% that the networks got wrong carry a lot of information about their respective strategies. Suppose AlexNet and ResNet both made an error on the following image by predicting “person” instead of “puppy”:

Photo by Charles Deluvio on Unsplash

Now making this error would be just as sad as this dog looks like, especially since many other puppies were recognised perfectly well, like these here:

You probably know what puppies look like, but aren’t they cute?! Photo by Bharathi Kannan on Unsplash.

Looking at some more images that the networks failed to recognise, you start forming a suspicion: Could it be the case that both networks implemented the classification strategy “whatever wears clothes is a person”? You’ve always suspected AlexNet to be a bit of a cheat — but what about you, ResNet? Is it too much to ask for a bit more depth of character from someone with 152 layers?

Closer inspection confirms: conversely, the networks also misclassified a few people as “puppies” —we leave it to the reader’s imagination how these images may have looked like if the model’s decision strategy relies on the degree of clothing.

Erring is telling, and we can exploit this property: if two systems (e.g. two different CNNs) implement a similar strategy, they should make errors on the same individual input images — not just a similar number of errors (as measured by accuracy) but also errors on the same inputs: similar strategies will make similar errors. And this is exactly what we can measure using error consistency.

Introduced in our recent paper, trial-by-trial error consistency assesses whether two systems systematically make errors on the same inputs (or trials, as this would be called in psychological experiments). Call it an analysis based on trial and error if you like.

Leaving the hypothetical toy dataset (puppies vs. people) aside, how similar are the strategies of different CNNs trained on a big dataset (ImageNet)? And are they similar to human errors on the same data? We simply went to the animal, pardon, model farm and evaluated all ImageNet-trained PyTorch models to obtain their classification decisions (correct responses vs. errors) on a dataset where we also have human decisions for comparison. Here’s what we’ve discovered.

All CNNs are created equal

When we look at the plot below, one can see that the 16 different CNNs have a broad range of ImageNet accuracies, ranging from about 78% (AlexNet, brown, on the left) to about 94% (ResNet-152, dark blue, on the right). If two systems make just as many identical errors as we would have expected by chance alone, we would see an error consistency of 0.0; higher error consistency is an indication for similar strategies (up to 1.0 for identical strategies).

CNNs are similar to other CNNs — but not like humans: higher error consistency (up to k=1.0) is an idication for similar strategies. Image by author.

Interestingly, error consistency between CNNs and humans (dashed black line) is pretty close to zero: humans and CNNs are very likely implementing different strategies, and this gap is not closing down with higher model performance. Humans vs. humans (red), on the other hand, are fairly consistent: most humans make similar errors as other humans. Perhaps most surprisingly, however, CNNs vs. other CNNs (golden) show an exceptionally high consistency: CNNs make very similar errors as other CNNs.

Are all CNNs created equal? Ever since AlexNet was introduced in 2012, we’ve seen tremendously advanced neural network architectures. We now have skip connections, hundreds of layers, batch normalization, and much more: but error consistency analysis suggests that what we’ve achieved is an improvement in accuracy, not a change in strategy.

The highest recorded error consistency in our analysis even occurs for two very different networks: ResNet-18 vs. DenseNet-121, two models from a different model family with different depth (18 vs. 121 layers) and different connectivity. Largely irrespective of architecture, the investigated CNNs all seem to implement a very similar strategy — different from the human one.

… but are some CNNs more equal than others?

You may point out that we’ve only ever tested feedforward networks, which doesn’t seem right since it is well known that the human brain has an abundance of recurrent computations. Surely we can’t generalise these findings to all CNNs, including recurrent ones?

And that’s certainly true, we can only make definite statements about the CNNs we tested. In order to find out whether a recurrent network is different (“more equal than others”), we analysed a recurrent CNN as well. And not just any recurrent network: CORnet-S, the world’s most brain-like neural network model according to the Brain-Score website, CORnet-S, published as an Oral contribution at NeurIPS 2019, CORnet-S, “the current best model of the primate ventral visual stream’’ performing “brain-like object recognition” according to its authors.

CORnet-S has four layers that are named after brain areas V1, V2, V4 and IT. Figure credit: https://papers.nips.cc/paper/9441-brain-like-object-recognition-with-high-performing-shallow-recurrent-anns.pdf (cropped from https://github.com/dicarlolab/neurips2019/blob/master/figures/fig1.pdf)

CORnet-S has four layers termed “V1”, “V2”, “V4” and “IT” after the corresponding ventral stream brain areas responsible for object recognition in primates, including humans. When comparing CORnet-S against a baseline model, feedforward ResNet-50, this is what we find:

Recurrent CORnet-S (orange) makes similar errors as feedforward ResNet-50 (blue), but not human-like errors (red). Many orange and blue datapoints even overlap exactly. The x axis is chosen such that we can visualise the grey area in which we would expect models to be if they only show consistency due to chance.

Now this is curious. Humans make similar errors as humans (red), but recurrent CORnet-S (orange) makes almost exactly the same errors as feedforward ResNet-50 (blue): the two networks seem to implement a very similar strategy, but certainly not a “human-like” one according to error consistency analysis. (Note, however, that it really depends on the dataset and metric: CORnet-S shows promising results in capturing recurrent dynamics of biological object recognition, for example.) It seems that recurrent computations —which appear to be of particular importance in challenging tasks — are no silver bullet. While recurrence is often argued to be one of the key missing ingredients in standard CNNs towards a better account of biological vision, a recurrent network does not necessarily lead to a different behavioural strategy compared to a purely feedforward CNN.

Summary

  • All CNNs are created equal: irrespective of architecture, all sixteen investigated CNNs make very similar errors.
  • No CNN is more equal than others: even recurrent CORnet-S, termed the “current best model of the primate ventral visual stream”, behaves like a standard feedforward ResNet-50 according to error consistency analysis (but both networks fail to make human-like errors).
  • Humans are created differently. The strategies used by human and machine vision are still very different.

Quantifying behavioural differences might ultimately be a useful guide for narrowing the gap between human and machine strategies, such that some time, some day we may reach the point where, in the wise (and only ever so slightly edited) words of George Orwell, the following has become a reality:

They looked from model to man, and from man to model, and from model to man again; but it was impossible to say which was which …

--

--

Research Scientist at Google Brain | Previously @ International Max Planck Research School for Intelligent Systems & U of Tübingen | robertgeirhos.com