Understanding YOLOv7 Neural Network
A bit more detailed …
Note: This is a living document. Expect it to get updated as I dig more.
1. Introduction
YOLOv7 is one of the models in the YOLO (You Only Look Once) series of object detection. There are many articles on the web that discusses YOLOv7 architecture. But none of them are comprehensive enough with end-to-end architectural component description. The purpose of this post is to serve as a guide for end-to-end YOLOv7 neural network understanding.
YOLO network consists of three main components as shown in Figure 1
- Backbone: A convolutional neural network creates images features aka. embeddings
- Neck: A collection of neural network layers that combines and mixes features to pass it to the next stage for prediction
- Head: Consumes features from the neck creates prediction outputs.
Specifically YOLOv7 architecture looks like below
Note that the diagram on Figure 2 was created by the folks at mmlab, not by the authors. Therefore naming of some network blocks might not exactly match…