12. Introduction to Deep Learning with Computer Vision — Receptive fields

Inside AI
Deep-Learning-For-Computer-Vision
8 min readJun 18, 2020

written by Praveen Kumar & Nilesh Singh.

Receptive fields are one of the core concepts in CNN architecture. Over the years there have been a lot of architectures which deploy numerous techniques to improve their accuracy & decrease overall error. However, most of them have one thing in common, clever ideas of manipulating receptive fields in order to boost accuracy.

As a beginner’s definition, the Receptive filed simply means the ability of the network architecture to see an area in the image. Let’s intuitively refine this definition.

Fig 1: Receptive field

In this above image, the eye can see a circular area on the visual field. Here, the visual field is nothing but our main input image and the area which we can see is called the receptive field. The eye on the other hand depicts the various layers of CNN architecture. The last layer (one before output) can see either a part of the input image or the whole image based on how farther it is from the input image. The deeper it is, the receptive field increases.

We will define few architectures & understand how their perception of the receptive field was wrong until late 2018 & later on we will deep dive into the receptive field & receptive field calculations.

Why the receptive field is very important?

Let’s look at 3 images to show you object sizes differ based on the input image & see why it matters so much.

Fig 2: Example Image 1

In Fig 2, the size of the dog is smaller than the size of the image. Hence to recognize that a dog is present in this image, we need a very small receptive field because even looking at a smaller field on the image, we will be obtaining all the features we require to detect a dog.

Let’s look at our second example image.

Fig 3: Example Image 2

In the above image, we see that the size of the dog is the same as the image. In this case, our network needs to look at the whole image to be able to collect all the features and then detect if the dog is present or not. In this case, if we keep a smaller receptive field, we will be able to deduce fewer features and thus, our model will be less sure about the detection. Hence, in this case, we need a receptive field as the size of the image.

Let’s look at our third image now.

Fig 4: Example Image 3

In the above image, the dog is not completely visible. Now how much should be our receptive field? While you think, we tried our best to draw the following image… but don’t stop thinking about the size of receptive field ;).

Fig 5: Custom Dog Drawing (Xd)

Well, if you think that it should be surely larger than the image size, you are right. Let’s gain more understanding of this.

If we allow a model whose receptive field is the same as the size of the image in Fig 5, it will be able to see parts of a dog face and nothing more. So, it will collect fewer features and then try to detect dog. Hence the confidence score will be lesser in this case compared to the previous 2 images.

Question: Why did we draw the dog in Fig 5?

It is to show that when the size of the object is larger than the size of the image, we need a larger receptive field to be able to see all of the features that we need from our input.

NOTE: It is important to understand that we have not spoken about the location of the object in the image. The location of the object does not matter as long as we have a receptive field to cover the same. However, before 2017, all the papers (ResNet, Inception & Xception, VGG, AlexNet, & Yolo) did not consider the receptive field as the major factor and rather spoke about object location, which is why Inception used Group convolution of different kernel sizes. We shall see all the drawbacks of these architectures as well.

We should analyze the architectures to get a deeper understanding of why the receptive field is more important in denser architectures.

Where did it start?

Fig 6: ILSVRC [Source]

The above image shows the journey to different kinds of architecture. When AlexNet came, it was the start of Deep Neural Networks. Before 2012, it was all artificial neural networks. One could equate AlexNet to the discovery of fire in early stone age, simply revolutionary.

Let’s look at the few architectures and understand what was wrong in them. We will discuss the wrong intuition behind that architectures at the time when the research paper was published.

Fig 7: VGGNet

This is a plain network with layers stacked onto each other and few max pool layers in between. One of the main problems with this kind of architecture is, they can not retain the information till the last layer. When we have such simple multiple layers, we might feel it is good because the receptive field is larger, but if the object size is very small in the image, would you need an architecture such as this? No. Also, when we keep stacking up layers, we are asking the model to pass on all the features to the last layer. But this is difficult because we do not have a special connection between layers. All we have is a simple in-out connection between each layer & at each layer, features are extracted based on the features received from its previous layers. So, by the time the features hit the last layer, we have lost a lot of information. This is the main issue with simple plain architectures. To summarize, we have 2 main issues with such networks.

  1. We can not train the model well if we have a variable object size in an image. (Receptive field can not know before building architecture)
  2. No Special connections. So we lose a lot of information & can not go deeper just thinking about the receptive field.
  • Inception V1 [Paper]

This paper was present by Google Developers at ILSVRC Challenge in the year 2014 when the concepts around the Deep Neural Network were not built in the right intuition as they are built today. They pointed out 3 important issues with the current CNN’s(at that time).

  1. Huge variation in the location of information, so choosing the right size of the kernel is difficult.
  2. Very Deep networks are prone to overfitting. It is also hard to pass gradient updates.
  3. Simply stacked architectures are hugely computation expensive.

Based on our current situation, we know that point 2 & point 3 are true as we have discussed above. However, point 1 claims that the object location varies in the image which is why kernel size can not be fixed. This is the wrong line of intuition as we know that location of the object does not matter if we have a kernel. It will always fire whenever it sees the object in the image, no matter the location.

So to solve for point 1, they introduced multi-sized kernels & to solve for point 2, they introduced a wider architecture rather than a deeper architecture. Let’s look at building blocks of inception V1 architecture.

Fig 8: Naive Inception [Source]

To solve different sized kernels, they introduced multiple kernel sizes in the architecture. However, the number of parameters was too large in this naive block. Hence they introduced a 1x1 kernel to save the number of parameters & make the architecture less computation expensive. 1x1 is also very effective in reducing the number of parameters as well we have discussed here previously. So updated architecture block is

Fig 9: Reduced parameters based Inception v1 block [Source]

Resources:

- These types of architectures belong to group convolution.

- Inception V1 Parameter calculation and deeper understanding of whole architecture.

Let's try to visualize the receptive field for this architecture.

Fig 10: Receptive field visualization for Inception v1

The Black outline represents the receptive field formed by 5x5 kernel & Red represents the receptive field by 3x3 kernel. As we can see, now the model is more flexible to different sized objects. This is why if we observe the Inception-V1 architecture, there are several output connections in the middle of the architecture. These outputs are taken because an object can be found in a lower receptive field(after few initial layers), or some objects which could be almost size of the image could be found at a later stage of the model.

Now, if we consider other architectures such as ResNet, ResNeXT, SENet, these architectures also play with receptive field knowingly or unknowingly. However, the concept of receptive field has never been clearly spoken out in any of the research papers. We wanted to point out that this is one of the key factors which decides how well a model is doing.

Let’s come back to the Receptive field. Now that we understand that it is very important. Let’s look at the Human eyes receptive field.

Fig 11: Human eye receptive field [Source]

The human brain only has 4 layers[Source], and they are enough for us to detect anything and find any kind of pattern to learn images. However, we have even designed CNN networks with 1000+ layers, and still, it difficult for us to find the right receptive field-based architecture.

Fig 11 shows 3 different receptive fields of our brain. Few things to observe here

  1. Our receptive fields are not square boxes. It is circular unlike CNN’s.
  2. V1 is used to capture precise data(gradients, textures), V3 is used to capture parts of objects(Patterns & Parts of objects), & hV4 is used to capture the whole scenario(Object itself). Note that this is what we already covered in article 5. The same methods are also followed by our brain, just that, in a different way.

NOTE: We are starting a new telegram group to tackle all the questions and any sort of queries. You can openly discuss concepts with other participants and get more insights and this will be more helpful as we move further down the publication. [Follow this LINK to join]

--

--

Inside AI
Deep-Learning-For-Computer-Vision

We write about NLP, Speech Recognition, Computer Vision, Kaggle, and Data Science Competitions.