CNN Series Part 1: How do computers see images?

Shweta Kadam
Analytics Vidhya
Published in
7 min readJul 7, 2020

In this article, we will learn about how computers see images & the issues faced while performing a computer vision task. We will see how deep learning comes into the picture & how with the power of neural networks, we can build a powerful computer vision system capable of solving extraordinary problems.

One example of how deep learning is transforming computer vision is facial recognition or face detection. On the top left, you can see the icon of the human eye which visually represents vision coming into the deep neural network in the form of images, pixels, videos & on the output on the bottom you can see a depiction of the human face or detection of the human face or this could also be recognizing different human faces or emotions on the face and also the key facial features, etc.

Now deep learning has transformed this field specifically because this means that the creator of this AI doesn’t need to tailor that algorithm specifically for facial detection but instead they can provide lots and lots of data to this algorithm and swap this end piece (left face icon) with facial detection or many other types of detection or recognition types and the neural network can try and learn to solve that task.

For example, we can replace the facial detection task with disease detection in the retina of the eye & also similar techniques can be applied towards detection, classification of diseases throughout healthcare, etc.

Now that we have got a sense of at a very high level of some of the computer vision tasks that we as humans can solve every day and we can also train machines to solve these tasks for us.SO the next natural question is to ask is

How Can Computers See?? and specifically how does a computer process an image or a video? How do they process the pixels coming from those images?

Well to a computer ,images are just numbers .

Suppose we have a picture of handwritten number 8, it is made of pixels & since it is a gray scale image, each of these pixels can be represented by a single number & now we can represent our image as a 2-dimensional matrix of numbers, one for each pixel in that image and that’s how a computer sees that image. It sees a matrix of 2-dimensional numbers.

Images are just numbers

Now if we have a color image i.e RGB image instead of a grayscale image, we can represent that as 3 of these 2-dimensional images concatenated or stacked on top of each other one for each channel, that is for Red, Green & Blue and that’s RGB.

Now given this foundation which you just learned we can apply 2 common types of machine learning which are classification and regression. In regression, we take a continuous value & in classification we have our output take a label.

So let’s start with classification, specifically to image classification, we want to predict a single label for each image.

For example, we have a bunch of handwritten numbers here & we want to build a classification pipeline to determine which number is there in this image that we are looking at. Outputting the probability that this image is each of those handwritten numbers.

In order to classify this image correctly, our pipeline needs to be able to understand & tell what is unique about a picture of handwritten no. 8 v/s what is unique about a picture of handwritten no. 4. v/s a picture of 7. It needs to understand those unique differences in each of those images or features.

Now another way to think about this image classification pipeline at a high level is in terms of features that are characteristics of a particular class.

Features are characteristics of a particular class.

Let’s identify key features in each image category.

Classification is done by detecting the type of these features in each class. You detect enough of these features, specific to that class you can say with pretty high confidence that you are looking at that class i.e our model needs to know what those features are and it needs to be able to detect those features to generate a prediction.

Now one way to solve this problem is to leverage domain knowledge about your field.

Manual Feature Extraction

Suppose we want to detect human faces then we can leverage our knowledge about human faces. We can first detect eyes, nose, mouth, ears, etc. & once we have a detection pipeline for those we can start to detect those features & determine if we are looking at a human face or not.

Now there is a big problem with that approach, in the preliminary detection pipeline, how do we detect those ears, eyes, nose, mouth? and like this hierarchy is kind of our bottleneck.

Remember that images are just 3-dimensional arrays of numbers. They are 3-dimensional arrays of brightness values & that images can hold lots and lots of variations, such as illumination conditions, background clutter & also interclass and intraclass variation.

we need to be invariant to Intra class variations & be sensitive to Inter class variation

Now first let’s understand what do we mean by intraclass and inter-class variation? Intra-class variations can be divided into two types: Intrinsic factors and imaging conditions. In intrinsic factors, each object category can have many different object instances, possibly varying in one or more of color, texture, material, shape, and size, such as the “chair” category. In the “chair” category example, all the images belong to the chair class but there are a lot of variations as to how a chair looks. Our model needs to understand all these variations in a single object class.

Intraclass Variation:Different instances of “chair” category.

Imaging condition variations are caused by the dramatic effect unconstrained environments can have on object appearances, such as lighting (dawn, day, dusk, indoors), physical location, weather conditions, cameras, backgrounds, illuminations, occlusion, and viewing distances. All of these conditions produce significant variations n object appearance, such as illumination, pose, scale, occlusion, clutter, shading, blur, and motion, etc

Intraclass Variation:Changes in appearance of the same class with variations in imaging conditions

In contrast, the inter-class variations measure the differences between images that have different class labels. For example, in the first row, our model needs to be able to differentiate between Marilyn Monroe & people who look like Marilyn Monroe.

Another example could be species of animals who look similar but in fact, are from four different object classes or 4 different breeds.

Interclass Variation

And if we want to build a robust model for doing this image classification task, the goal is to maximize inter-class differences and minimize intra-class differences. That is, those in each class must be as similar as possible, and those in different groups must be as different as possible.

We need to be sensitive to variations between classes(inter-class) and invariant to variations within a single class. Now due to the incredible irregularities in image data specifically, the detection of these features is super difficult in practice, defining the manual extracting these features can be extremely problematic. So how do we tackle this???

One way is to extract these visual features and detect their presence in an image simultaneously in a hierarchical fashion and for that we can use neural networks.

Can we learn a hierarchy of features directly from data instead of hand engineering?

Our approach is going to be here to learn the visual features directly from data & to learn a hierarchy of these features so that we can reconstruct a representation of what makes up a final class label .ie what features(lines, curves, etc) make a nose or eyes.

Now that we have a foundation of how images work, we can move on to asking ourselves How we can learn visual features with neural networks?

In the next article, we will see how neural networks will allow us to directly learn those features from visual data using convolution & we will understand what role convolution plays in Convolution Neural Networks.

Happy Learning.(^_^)

--

--