Using softmax carefully
In deep learning, softmax is a very common and important function, especially in multi-classification image recognition. Originally, it’s performed well in ImageNet. It mapped some inputs to numbers between 0 and 1, and normalized them to ensure that the sum was 1, so that the sum of the probabilities of multiple classifications was exactly 1.
We take an example. There’re two images. In Fig 1, there’re five categories. We can see outputs of fish and building are also positive. So it’s difficult to figure out what it is. So we used softmax to calculate the probabilities of classes. And the fish’s softmax value is the highest, so the net would classify it as fish.
In Fig 2, according to the outputs of Image 2, only the fish class is positive. So it’s clear that Image 2 if fish. When we calculate the softmax value of classes, the fish softmax is as same as Image 1 softmax value. Actually the output value of the two images are not same.
In two images, it’s maybe going to tell you with high probability that it has one of those things even if it has more than one of them, it’ll just pick one of them and tell you it’s pretty sure it’s got one. Softmax is a terrible idea unless you know that every one of yours when you’re doing image recognition.
So softmax works in these strict rules:
- only one class
- must belong to one of the classes
If softmax doesn’t work well. There are two ways to deal with image recognition.
1. Create another category called background or null or missing. So in the example, there’s 6 categories cat, dog, plane, fish, building or missing.
A lot of researchers does it, but it doesn’t work some times. The reason why is that if you want to predict a missing category, the penultimate layer activations have to have the features in it. We can clearly distinguish which class it is. But there’s no set of features that when they are all high clearly is not cat,dog, plan, fish, building. That’s not kind of object. It’s hard to classify which class it doesn’t belong to. So lots of well-regarded academic papers make this mistake.
2. Use binomial regular, which is the result of exp/(1+exp). It’s exactly the same as softmax if your two categories have the thing and doesn’t have the thing. In Fig 3, you can see how the numbers are different from softmax value. Image 1, which looks like there might be a cat, a fish and a building in it. But in image 2, there’s maybe a fish.
Therefore, when you deal with image recognition, probably most of the time you don’t need softmax. If a paper used softmax, it’s necessary for you to think does that actually work with softmax. Maybe the answer’s no. Try replicating it without softmax, maybe you can get a better result.
In fact, softmax is good for language modeling. Because it’s definitely at least one word and just one word.