This article is the second in the series where I thoroughly explain how the YOLOX (You Only Look Once X) model works. If you are interested in the code, you can find a link to it below:
GitHub - gmongaras/YOLOX_From_Scratch
Attempting to build the YOLOX algorithm from scratch.
This series has 4 parts to fully go over the YOLOX algorithm:
- What is YOLO and What Makes It Special?
- How Does YOLOX Work? (self)
- SimOTA For Dynamic Label Assignment
- Mosaic and Mixup For Data Augmentation
Darknet-53 — The YOLOX Backbone
The YOLOv3 algorithm is the basis for many object detection algorithms and is also what YOLOX uses. Before going into YOLOv3, I am assuming you have knowledge of how YOLOv1 works which was briefly explained in the last article.
The YOLOv3 algorithm is very similar to the original YOLO algorithm, but it makes some small changes that make a significant impact.
A major change YOLOv3 makes is it uses a large backbone called Darknet-53. (A backbone is just a very unspecialized structure which data is originally sent through). The architecture of the backbone uses 1×1 convolutions, residual connections, and 3×3 convolutions to make a very powerful feature extractor. This backbone is also what YOLOX uses and it has the following architecture.
The first step to get a prediction from YOLOv3 is to send an image through this backbone which then encodes the data so the YOLOv3 head can make final predictions. (unlike a backbone, a head is a specialized architecture that is used to make predictions).
The head of the YOLOv3 model is basically the same as the YOLOv1 model. The difference between the two doesn’t matter in our case since YOLOX completely changes the head of the model. Remember that the YOLOv1 model’s final predictions were basically a massive 3-dimensional tensor where the length and width were different predictions and the depth was different features of a prediction.
One question I had when reading the YOLOv3 paper is how do I get the features from the Darknet backbone since the output goes through a softmax layer? Well, the model uses something called a Feature Pyramid Network (FPN). A feature pyramid network extracts information from an image with different aspects (different widths and heights). To do this with Darknet, we take transition states from the model and use those as several outputs instead of a single output coming from the end of the network. Below is a diagram of how this works.
Essentially, the Darknet-53 backbone (which I may call the FPN from now on) outputs three different predictions at different scales:
- 256 channels (first transition output)
- 512 channels (second transition output)
- 1024 channels (third transition output)
Each of these outputs extracts information at a different scale. Notice how as the number of channels increases, the length and width of the image decreases. So the 256 channel transition output extract features at a smaller scale while the 1024 channel output extracts features at a larger scale since the 1024 channel output has less information from the original image to work with.
YOLOv3 head vs. YOLOX head
Although the YOLOv3 backbone and the YOLOX backbone are the same, the models begin to differ from their heads. Below is an image showing the difference between the two heads.
When I first saw this diagram, I found it a bit confusing, so below is my marked-up version of the diagram which I hope makes it a little easier to understand.
This diagram is stating that the input into both the YOLOv3 head and the YOLOX head is the 3 outputs from the FPN (darknet) backbone at three different scales — 1024, 512, 256 channels.
The output of the two heads is essentially the exact same with dimensions (H×W×features) which is just like the original YOLO. The difference between the two heads is that YOLOv3 uses a coupled head and YOLOX uses a decoupled head. So, the output of YOLOX is actually 3 tensors each holding different information instead of 1 massive tensor with all information.
The three tensors YOLOX outputs hold the same information as the massive tensor YOLOv3 outputs:
- Cls: The class of each bounding box
- Reg: The 4 parts to the bounding box (x, y, w, h)
- IoU (Obj): For some reason, the authors use IoU instead of Obj, but this output is just how confident is the network that there’s an object in the bounding box (objectness)
Just like with the original output, each “pixel” in the height and width of the output is a different bounding box prediction. So, there are H*W different predictions.
The outputs listed above are only for a single output in the FPN. Remember there are three outputs from the FPN which are fed into the heads of YOLOv3 and YOLOX. This means there are actually three different outputs from each of the heads instead of 1. So, the output of YOLOv3 is actually (3×H×W×features) and the output of YOLOX is actually 3 of each of the Cls, Reg, and IoU (obj) outputs making 9 totals outputs.
Switching To An Anchor-free Model
One of the most important changes YOLOX made was not using anchors whereas YOLOv3 heavily relies on anchors.
What is an anchor?
An anchor is basically a predefined bounding box shape that helps the network. Instead of predicting the direct bounding box, previous YOLO algorithms predicted an offset from a predefined anchor box. Imagine an anchor box had a length and width of 100 and 50 while the model predicted a length and width of 10 and 15. The final bounding box prediction would be an offset from the anchor box with a length and width of 110 and 65. More information about anchor boxes can be found in this conversation.
Basically, an anchor box is a way to help the model so it doesn’t have to directly predict a bounding box.
The Problem With Anchor Boxes
Anchor boxes are basically extra parameters. How many anchors should the model use? What should the sizes of the anchors be? These questions lead to more hyperparameter tuning and less diversity in the model.
How Does YOLOX Fix The Anchor Box Problem?
YOLOX simply has the model directly predict the bounding box dimensions as opposed to predicting an offset from an anchor box. To directly predict a bounding box, YOLOX uses a decoupled head which was explained above. Additionally, it uses something called striding.
YOLOX was not just based on YOLOv3, it was also based on FCOS, which is another bounding box model, but it’s not part of the YOLO series making it not very cool.
FCOS uses striding to help the model out. Imagine the model has to learn to predict a bounding box anywhere from the top left of an image at 0,0 to the bottom right of an image at 1024,1024. In a discrete space, the model has 1048576 possible locations to predict from and it will likely not be able to learn anything due to this wide range of predictions.
Striding fixes this issue and allows the model to predict from an offset as opposed to from the top-left of the image. Basically, we can split the image up into a grid based on the three different scales the model will make predictions at. For example, the grids may look something like the following:
Using these grids, we can assign each prediction to each of the intersection points on the grid. The nice part about the predictions from the YOLOX model is they’re already in a length×width format. So we can directly map each of the outputs to a unique point on the grid and then use that grid point as an offset to scale the bounding box.
The grids above can be created by defining a certain stride which is the distance between each of the intersection points on the grid. In the YOLOX algorithm, strides of 32, 16, and 8 are used for each FPN level respectively. If a stride of 32 is used on a 256×256 image, then thee will be a total of 256/32 = 8 intersection points on each dimension totaling 64 intersection points.
For example, I am going to use the defined YOLOX FPN strides above to put a gride on the following image:
The following image has the grid overlay for the bear image:
Each intersection point on the image is called an anchor point. Don’t get this confused with an anchor explained earlier since this type of anchor is slightly different. An anchor point is an offset to move the x,y location of a prediction while the anchor explained previously (which is what YOLOX gets rid of) is a predefined box that is used as an offset for the w, h parts of a prediction. Anchor boxes are bad as they are extra hyperparameters to tune while anchor points are fine since they don’t involve extra parameters we have to tune.
Note: From now on, when I say anchor, I am referencing the location on the grid that YOLOX uses, not the predefined bounding box that YOLOv3 uses.
The anchor location on the image can be obtained with the following formulas:
x = s/2 + s*i
y = s/2 + s*j
Where s is the stride, i is the ith intersection point on the x-axis and j is the jth intersection point on the y-axis
For YOLOX, we use the grid points as top-left offsets of the bounding box. The following formulas are used to map a predicted bounding box (p_x, p_y, p_w, p_h) to the actual location on the image (l_x, l_y, l_w, l_h) if (x, y) is the intersection point on the grid which the prediction belongs to and s is the stride at the current FPN level:
l_x = p_x + x
l_y = p_y + y
l_w = s*e^(p_w)
l_h = s*e^(p_h)
We move the predicted point by adding the prediction to the anchor (the x,y point assigned to this prediction). We also denormalize the width and height by ensuring it’s not negative with an exponential function and moving it based on the stride of an image.
For example, let’s go back the bear image with a stride of 32. If the anchor point for this prediction was (i, j) = (2, 1) meaning intersection point 2 on the x-axis and 1 on the y-axis, I would be looking at the following point on the image:
Note: The point is at (2, 1) on the grid, but pixel-wise, it is at:
x = 32/2 + 32*2 = 16 + 64 = 80
y = 32/2 + 32*1 = 16 + 32 = 48
If the model gave me the prediction of (20, 15, 0.2, 0.3), then we can calculate the box as:
l_x = 20 + 80 = 100
l_y = 15 + 48 = 63
l_w = 32*e^(0.2) = 39
l_h = 32*e^(0.3) = 43
So, the final image may look like the following:
Not all predictions are equal. Some are clearly garbage and we don’t even want our model to optimize them. To differentiate between good and bad predictions, YOLOX uses something called SimOTA which is used for dynamic label assignment.
SimOTA will be explained in the next article, but for now, all you need to know is predictions that are thought of as “good” (which bound a ground truth object) are labeled as positive and those that are “bad” (which bound the background) are labeled as negative.
The negative labeled predictions are not just thrown away as one of the loss functions uses them.
The positive labels on the other hand are really important. How do we know which ground truth bounding box we want each anchor to optimize? (Note: ground truth means the bounding box we want the model to predict) SimOTA doesn’t just assign positive/negative labels, but it also assigns the ground truth bounding boxes to each positively labeled anchor in the image. This ground truth bounding box is very important for optimizing the model.
Loss Functions — Evaluating YOLOX
There are three outputs to the YOLOX model and each output has its own loss function as they need to be optimized in different ways.
As the YOLOX model states, the class output has the following shape: H×W×C. So for each prediction, the model predicts a vector of C elements.
C is the number of classes that can be chosen from. So each element represents the probability of that class, or how confident the model thinks that class is the one in the bounding box.
To optimize this, we can use a one-hot encoded vector which encodes the class of the ground truth bounding box for each anchor/prediction. The one-hot vector has C number of elements for each prediction. The 1 in the one-hot vector goes in the location of the class we want the model to predict while a 0 is placed in all other locations. For example, if I had four classes and I wanted the model to predict the second, the vectors might look like the following
pred1: [0.45, 0.25, 0.05, 0.25] # The model is most confident in the 1st class
pred2: [0.25, 0.25, 0.25, 0.25] # The model is not confident in any class
pred3: [0.1, 0.7, 0.1, 0.1] # The model is very confident in the second class
labels: [0, 1, 0, 0] # 1 in the second location which is what we want the model to predict
To optimize these predictions, we can put both the predictions (of shape H×W×C) and the ground truth labels (also of shape H×W×C) through a Binary Cross Entropy (BCE) loss function. Specifically, we put all positive predictions through BCE with logits which is a fancy way of saying throw the predictions through a sigmoid and then put it through the BCE function.
Note: The negative labeled predictions are not used in this loss
The reason a one-hot encoded vector is used as opposed to only using the correct class is to help the model learn that the correct class should have a weight of 1 while the others should have a weight of 0. Notice how the class output of the model has the same number of dimensions as the one-hot vector. The model is not predicting a single value, rather it is predicting a distribution of all possible classes. So, we shouldn’t optimize for a single value, instead, we want to optimize for all values it predicts.
Optimizing the regression (bounding box prediction) outputs is a little more tricky than the class outputs. Remember that the shape of the regression output is H×W×4 where each prediction is (x, y, w, h).
One might think that using Mean Squared Error (MSE) is a good evaluation metric as it is a regression task. YOLOv3 actually uses a similar metric called Sum of Squared Error (SSE). The problem with these metrics is it causes the model to overfit to the regression targets in the training sample.
Intersection Over Union (IOU)
To overcome this problem, YOLOX uses an evaluation metric called IoU.
IoU is calculated by taking a predicted bounding box and comparing it to its ground truth bounding box. IoU first calculates the intersection between the two bounding boxes, then the union. The final result is the intersection between the boxes divided by the union of the boxes.
Let’s take a closer look into IoU. First off, the intersection can never be greater than the union and the smallest intersection is 0 giving us the following constraint:
0 ≤ I ≤ U
When there is no intersection, the union will be the area of both boxes combined (A₁+A₂) and when the boxes have a 100% intersection, the union will be the area of one of the boxes (A₁) giving us the following constraint:
A₁ ≤ U ≤ A₁+A₂
So, as the intersection grows, the IoU goes toward 1 because the intersection and union converge to the same value. As the intersection decreases, the IoU goes toward 0 because the union grows, making the numerator smaller and the denominator larger. So, IoU has the following values:
0 ≤ IoU ≤ 1
We actually want to maximize the IoU value since we want the intersection to contain the entirety of both boxes. The problem with that is gradient descent minimizes the loss, so the loss of IoU is taken by 1-IoU giving us the same values, but flipped. So the loss has a higher value when the intersection is closer to 0% and a lower value when the intersection is closer to 100%.
Evaluating The Regression Output
To evaluate the regression output, we actually use Generic Intersection Over Union (GIoU). GIoU is similar to IoU except it has values between -1 and 1. The problem with IoU (the value, not the loss) is that boxes with an IoU of 0 don’t have any extra information. There is a wide range of IoUs with a value of 0, so GIoU fixes that. The idea is the same except it encodes some more information and allows for a more smooth function that has a non-zero value when IoU would be 0.
To optimize the model, we minimize the GIoU loss directly by taking the sum over all positive predictions and minimizing that sum.
Note: The negative labeled predictions are not used in this loss
We want the model to have an objectness score near 1 if it thinks there’s an object in the box, a score of around 0 if it doesn’t think there’s something in the box, and anywhere in between (hopefully around 0.5) if it’s uncertain.
So, we want a function that is between 0 and 1 that is closer to 1 when the bounding box is covering an object perfectly and 0 when the object does not cover the object at all. A perfect function for this is IoU. Specifically, we want to use IoU, not GIoU since GIoU has a range of 2 while IoU has a range of 1.
Similar to the class loss function, we will use BCE to optimize the objectness predictions. To optimize a single prediction, there are two possibilities:
- The easy predictions to consider are those that are labeled as positive.
- The second set of predictions we must consider are those labeled as negative.
Since SimOTA assigns the ground truth bounding box to the positive predictions, we can take the IoU between the predicted bounding box and the ground truth bounding box to get the value we want the model to predict. Then, we can put the predicted objectness and IoU value in the BCE loss function to get the loss for this prediction.
The problem with the negative predictions is that we want the objectness loss to optimize bad predictions so that it learns what a bad prediction looks like vs. what a good prediction looks like. SimOTA doesn’t assign a bounding box to the negative labeled predictions. So how do we get the ground truth bounding box for the negative predictions?
One way to assign the objectness labels for each negative prediction is to assign a value of 0 to all negative predictions. A problem with the strategy is that there are better negative labels and worse negative labels. Not all negative labels are equally as bad.
A better way to get the ground truth for negative predictions is by looking at all ground truth bounding boxes in an image. We can compute the IoU between the predicted negatively labeled bounding box and all ground truths. Then, we take the largest IoU value (meaning the predicted bounding box covers that ground truth more than all other ground truths) and assign this IoU value to the predicted bounding box. Then we can take the BCE with logits between the predicted objectness and that assigned IoU value to get the loss for that negative label.
Note: All predictions, including the negative labeled ones, are used in this loss.
Final Loss Function
The final loss function is a combination of the three losses stated above and is defined as follows:
The loss function is basically the sum of all losses averaged over the number of positive labels. Remember, we used SimOTA to assign labels to each prediction.
reg_weight is a balancing term used to weigh the regression loss over the other losses as it’s most important to optimize. The authors use a weight of 5.0.
The YOLOX model makes inferences like most other machine learning models, but has a major problem that needs to be dealt with. As usual, to make an inference, one would send the data through the model like normal.
Before doing anything with the outputs, remember we use BCE with logits, not BCE to optimize the class and objectness predictions. BCE with logits is optimizing the sigmoid of the outputs, not the default output. So, the first thing to do to get the data in the correct form is to take the sigmoid of both the class and objectness values. Also, remember that the model predicts a distribution for each class prediction, not the class value. So, we want to take the argmax of each class prediction to get the final prediction for the class:
final_cls = argmax(sigmoid(cls), axis=-1)
final_obj = sigmoid(obj)
As for the regression targets, we have to move them to their correct location as done with striding defined earlier in this article.
A problem that we face when making inferences is that YOLOX outputs a lot of bounding boxes and most of these bounding boxes aren’t good predictions. To handle this issue, the model’s output goes through two pruning steps:
- Remove all outputs with a confidence score (objectness) under a certain threshold. When I coded the YOLOX model, I removed all predictions with a confidence score below 0.5.
- Use Soft Nonmax Suppression to further prune the predictions even further.
After putting the predictions through these two steps, a small number of predictions should be left which are then the final predictions for the model.
Nonmax suppression is a very good way of pruning bounding boxes without knowing where ground truths are in the image. To do this, the algorithm basically removes predictions with high overlap as shown in the following image:
The way nonmax suppression removes bounding boxes with a high overlap is by using the IoU score between overlapping bounding boxes. Those with a high IoU are removed so that a single bounding box is kept.
Below is the pseudocode I used to implement soft-NMS (Nonmax suppression):
B - The predicted bounding boxes with shape (x, y, w, h)
S - The confidence score for each bounding box (objectness)
C - The class for each boudning box
score_thresh - The score threshold to remove boxes
IoU_thresh - The IoU threshold to update scoressoftNMS(B, S, C, score_thresh, IoU_thresh):
D =  <- Boudning boxes we want to keep for all images for img in imgs:
b = B[img]
s = S[img]
d =  while b not empty:
Get the bounding box with the highest score and save it
m = argmax(s)
M = b[m]
Remove the bounding box with the highest score from the lists
Get the mean of all confidence scores
mean_scores = mean(s) Get the IoU between M and all b
IoU = IoU_funct(M, b)
Update all scores, s, where the IoU > IoU_thresh
idx = argwhere(IoU > IoU_thresh)
s[idx] = s[idx]*e^(-(IoU[idx]**2)/mean_scores) Remove the bounding boxes from b where s < score_thresh
b = b[s >= score_thresh] Save the bounding boxes for this image
The following are the steps in a different format:
- Get the bounding box with the highest objectness score
- Remove the selected prediction from the lists
- Get the mean of all remaining objectness scores
- Get the IoU between M, the selected bounding box, and b, all other bounding boxes
- Update the scores of the bounding boxes with a high IoU
- Remove all bounding boxes from b where the score is less than the score threshold
- Repeat steps 1–7 until b is empty
The formula used to update the score can be found in the original soft-nonmax suppression paper (on page 4). It is one of the formulas the paper suggests can be used to create soft nonmax suppression along with several others.
That’s basically all there is to YOLOX. In the next article, we will go over how SimOTA works for dynamic label assignment.