Computer Vision Challenges: Why my neural network called me a punching bag…

Vadim Karpusenko
Microsoft Azure
Published in
5 min readApr 6, 2020
a punching bag — 17% confidence

Let me tell you one embarrassing story behind this image. Ok, may be not “super” embarrassing but definitely funny, at least for the audience at that time. This picture of me (@Vadi) was taken about two years ago when I was presenting on stage. The idea of our technical demo for the show was very simple. Step by step we were showing how to create photo-sharing app. Anyone could upload images and photos on our demo website or mobile app and audience could vote up or down for them. My part of the demo was to grab these images and assign automatic tags to them using pre-trained Machine Learning models. Meaning my tamed AI was telling me what it was seeing on the image and that was used for tags and labels. It just happened that the last picture uploaded to the demo website right before my tech demo was the one you can observe above. So, I’m on stage showing life demo where step by step I execute python code doing the image handling, conversion, and finally prediction. If you want to see exactly what was the result — please, go to the demo page, copy-paste the provided link, and press “Submit” button. You will not regret! 😆😂🤣 Then continue reading here.

I’ve created an interactive demo, so everyone can test different images to see what prediction pre-trained MobileNet_V2 on ImageNet data will give them out of 1000 classes and visualize the heatmap of Neural Network attention, known in scientific literature as CAM method (Class Activation Mapping). Feel free to test it out here: https://aka.ms/CVattention

Yeppp! After executing the code cell with predictions I’ve got “PUNCHING BAG” printed out in Bold on the projector screen behind me. The audience EXPLODED IN LAUGHTER!!! 😂😜

So, after a little bit of a shock (well, not everyday you’ve got called ‘punching bag’, especially by Artificial Intelligence you kinda created…) I’ve decided to take a look at the reasons why this label showed up on the first place. Although, the confidence for this prediction was relatively low (only ~17%) we still could trace what neurons were activated in the network to produce this result. After visualizing (picture above with “attention” heatmap highlighted in green) the averaged activity intensity in the last convolutional layer it turned out that the shape of the shades in the background on stage somewhat resemble couple of punching bags and neural network made this prediction using visual features located in there. 🤓😎🥳

Please, go ahead and play with the demo, test different images. If you find something interesting and/or funny — feel free to share with your friends. You can safe the image and add it to your post. The most interesting ones shared with #AIApril tag will get our shootout. This will be our Computer Vision Challenge for #AIApril (or Computer Vision challenges… depends how you look at it… 😂) To see more amazing content check out the AI April Content Dashboard!

My last advice before we switch to technical part: Don’t be afraid to laugh at yourself! Especially during this difficult times. Stay safe, stay sane, stay healthy.

CAM technique (Class Activation Mapping)

Convolutional Neural Networks are preferred Machine Learning models for Computer Vision nowadays. Two main operations used in these models are convolution itself, and pooling. Convolutional operation allows to find a particular pattern in the input image (in the first layer) and recognize combination of these patterns in the deeper layers. Pattern weights (or filters) convolutional neural network adjusts during training phase by the process call back propagation. Weights of these filters are changed slightly to better and better distinguish features (properties) of different classes the neural network is trained to recognize. Pooling reduces the dimensionality of processed data by shrinking information from 4 to a single pixel with the highest intensity (max pooling) or averaging values of usually 2x2 squares reducing these to a single numeric value.

Order of the operations, size and number of convolutional operations are called CNN (Convolutional Neural Network) topologies. The original example I’ve played with demonstrating CAM (Class Activation Mapping) technique was using VGG16. But this neural network has relatively large memory footprint. Trained network weights file takes about 560Mb on a hard drive. Thus, to reduce the start time and memory load of the model I switched to a lighter model called MobileNet_V2, which takes only 14Mb of space.

All input images are rescaled to 224x224 resolution for both VGG and MobileNet CNNs. After one or several convolutional layers pooling reduces resolution by factor of 4. E.g. 224x224 -> 112x112 -> 56x56 -> 28x28 -> 14x14 (-> 7x7 for MobileNet). At the same time number of channels (or filters) increases with shrinking. In the last convolutional layer filters are processing 7x7 image representations, which were transformed by convolutions and pooling right from the original image with 224x224 resolution. So, if we look at the last convolutional layer, average 1280 channels multiplied by class activation weights (meaning weights connecting neuron responsible for the class with the highest inference confidence) we can recreate 7x7 heatmap of where we have the most activated filter pixels. Overlaying this heatmap with the original image after re-scaling will visualize where our neural network was “looking” while doing the prediction.

Demo implementation

Statically hosted HTML page has some VueJS interactive elements. This all is rely on Azure Blob Storage functionality. Images are processed by Python code (Keras + PILLOW) running in Azure Functions.

Static hosting of this website can be done in a few easy steps described in this tutorial. Basically you can have a regular blob storage with special special blob container name $web and files from there can be statically serve to anyone with the link.

For Azure Function implementation of Machine Learning model inference and heatmap creation using CAM method I followed this tutorial. There’s an interesting trick used in the tutorial. If model during initialization is assigned to a global variable it will be available to other runs of this function instance. This saves the time needed to load model topology and weights into the memory.

Finally, if you are interested in the Machine Learning model itself you can check it out on Azure Notebooks or get it from GitHub repository.

Thank you! I hope you’ll have fun with this technique.

--

--

Vadim Karpusenko
Microsoft Azure

HPC researcher, Machine Learning / Deep Learning / AI Developer Evangelist