YOLO Model Compression via Filter Pruning for Efficient Inference on Raspberry Pi

9 min readNov 27, 2021

In previous post, I talked about convolutional neural network (CNN) model compression via pruning. Here in this post, I will share my experience performing this task on YOLOv2. The reason that I picked YOLOv2 was because of its relatively small size (only 23 layers) compared to more recent YOLOs. I trained and pruned the YOLOv2 model and the tiny-YOLO model with custom dataset. Then I optimized the model algorithmically by trying to reduce its size while maintaining performance. Lastly, I tested the compressed model against its non-compressed counterpart on the Raspberry Pi comparing their inference time, memory consumption, and model performance.

Dataset

The data set is composed of 980 images overlooking the road at an intersection on a street in Bangkok Thailand, where cars and motorbikes drive pass or stopped for a red light. Image size were 720 (width) by 480 (height) in JPEG format. The images were taken over the courses of two days where 875 images were taken on day-1 and remaining 105 images of 980 images were taken on day-2. Below are some samples of the images in the data set.

The data set was split into training set, validation set, and testing set. All images captured on day-2 are the testing set while the images from day-1 were randomly split into 9:1 training set and validation set. So there were 788 images in the training set, 87 images in the validation set, and 105 images in the test set.

Initial Model Training

As a baseline for comparison, the models with the pre-trained weights from DarkNet were evaluated on the test set before fine-tuning. The ADAM optimizer was used with learning rate set at 0.5e-04. L2 kernel regularization with a penalty factor of 0.01 was applied to all the layers to reduce overfitting. K-mean clustering analysis was done on the dataset bounding box, and 3 anchor boxes were chosen to keep the model final layers small.

A stopping condition was created to halt training once validation loss did not improved by at least 0.001 for 3 consecutive epochs. Below is the summary of model performance before and after fine-tuning.

Filter Pruning Algorithm

The objective of this algorithm was to reduce the size of YOLOv2 model by getting rid of the filters that were not important to the overall performance of the network. This was done iteratively until validation loss increased passed the “post-pruned-threshold”, after which, retraining was done to bring the validation loss down below the “post-retrained-threshold”. The post-pruned-threshold is the ratio between the newly pruned model’s validation loss to the validation loss of the full model. If loss of the newly pruned model exceeded this threshold, it will initiate retraining. The post-retrained-threshold is the ratio between the recovered validation loss to the validation loss of the full model. The algorithm will stop when the pruned network’s validation loss exceeded the post-pruned-threshold and even after retraining, its validation loss still exceeded the post-retrained-threshold. This algorithm was a single layer pruning meaning it removed filters from 1 layer at a time.

Information about the results of every pruning iterations such as layer selected for pruning, model’s parameters, validation loss, recall, precision, and the loss ratio to the post-pruned-threshold and post-retrain-threshold were recorded in a log file.

Determining Layer to Prune

Deciding which layer to prune was done by calculating the model’s validation loss after one of its layer has been modified. This modification was the removal of the layer’s filters based on the importance of each individual filter, which were was determined by calculating its sum absolute weights (SAW). The SAW scores were ordered from lowest to highest, and the lowest set was pruned. There were 22 layers in YOLOv2 excluding the last layer, so the algorithm cycled through each layer and pruned the weakest filters then recorded the resulting validation loss. In the end, I will have 22 losses where the layer (after modification) that produced the lowest loss was picked for permanent pruning.

Filter Pruning Operation

In YOLOv2 model, a typical single layer will start with a 2D-convolution operation, followed by a batch normalization operation, and end with an activation (LeakyReLU) operation. There are 3 main characteristics to a 2D-convolution operation: the kernel size, the depth of the kernel, and the filter number. Filter pruning modified the filter number of the 2D-convolution operation, but it also affected the number of parameters in the batch normalization operation. In addition, the modification of the number of filters within a layer will affect the kernel’s depth of 2D-convolution operation of the next layer. For an example, if the number of the filters were halved from 1024 filters to 512 filters in the n-th layer then the kernel’s depth of the n+1 layer had to be halved as well. The parameters in the batch normalization step of the n+1 layer were unchanged. Max-Pooling operation is not effect by this filter pruning.

The pruning-algorithm can prune filters within a layer in three steps: 75%, 50%, and 25%. For example; if pruning was set at 75%, then a layer initially with 1000 filters will get 75% of its lowest scored filters removed resulting in a layer with only 250 filters left. The pruning algorithm also allow all three steps size to be used one after another. It will start pruning the model at 75% iteratively until pruning resulted in a total loss that went above the post-pruned-threshold and retraining no longer bring this loss back below the post-retrain-threshold, after which the pruning algorithm reduced prune step size down to 50%. The same operation continued with 50% pruning and ending with 25% pruning. Once pruning-algorithm stopped (at 25% pruning), the algorithm will extract the model from the previous iteration (the one before the last pruning iteration) retrained that model and returning it as a resulting output pruned model.

Pruning-Algorithm Analysis

Three methods of calculating the importance factor for each individual filters were compared. They were SAW, L2 norm, and random selection. The L2 method just normalize the weights of all the filters within the layer. The results are shown in table below.

Based on the result, random method was not a viable strategy to pick which filters to prune because it damaged the model too fast without the ability to recover the loss. It failed to compress the model, achieving only one layer of pruning before the algorithm stopped. L2 and SAW were good candidates because they were able to reduce the model size down by over 90% and without too much drop in performance.

Below is a pie chart showing the proportion of the pruning step size that occurred throughout the entire pruning process for the SAW and the L2 method.

The majority of model compression was done at 50% pruning step size. A prune step size of 75% allowed the pruning-algorithm to compress the model faster but it removed filters too aggressively. The purpose of 25% pruning step size was to fine-tune model compression further more.

L2 and SAW methods were chosen for further study where both post-prune-threshold and post-retrain-threshold were set at 1.1 (meaning performance drop cannot exceed 10%). The result showed that SAW and L2 were able to maintain performance of the compressed model at similar level. However, on average SAW can compress the model better than L2 and produced slightly better performance compressed models as shown below.

The overall behavior of the pruning-algorithm of both L2 and SAW method can be summarized in a plot displayed in plot 1 and plot 2 below. Both methods used the same post-prune and post-retrain threshold of 1.1. The plot has two y-axis: on the left indicates the number of parameter in the model and the on the right indicates validation loss. The pruning-algorithm removed the number of parameters in the model in a decaying fashion where the number of parameter remaining in the model converged to a level. The “number of parameters in the model” plot is represented by a dashed line and is referenced by the left y-axis. It is also broken up into three parts indicating the pruning step size starting at 75% pruning (blue), followed by 50% pruning (red), and 25% prune (green). The validation loss after pruning is drawn in a solid light-blue line where it is referenced by the right y-axis. The red dots indicate where retraining occurred and what the recovered validation loss were.

Below is figure visualizing the layer picked for pruning for SAW_1.1 and L2_1.1, where the blue marking represented 75% pruning step size, red marking represents 50% pruning step size, and green marking represent 25% pruning step size. There were 22 column representing the 22 layers from left (first layer) to right (laster layer). Each rows represents an iteration of pruning.

Lastly, the pruned YOLOv2 model was used to draw the bounding boxes on the images of the test set. Below are some images of the results.

The same operation was performed on the Tiny-YOLO as well.

Running the Model on an Raspberry-Pi

The Raspberry Pi 3 model B+ with a 16 GB micro SD card was used. The operating system running on the Pi was the Raspbian Stretch 32-bit with Desktop kernel version 4.14. The Intel’s Movidius NCS was used as an inference engine to assist the Raspberry Pi in running the YOLO-based object detection program. Intel’s OpenVINO was used to compile the object detection code into an ARM-based executable program that can run on NCS.

Inferencing for all models was done on three settings: CPU only, NCS + CPU, and NCS + Raspberry Pi.

The two main metrics used for comparison were inference time in millisecond and memory usage of the system to run the program. Table below summarized all the models inference time on each platform.

The performance comparison for all the full model and the compressed model is summarized in Table below. For YoloV2, the model was compressed down by 95% but recall and precision only dropped slightly.

The memory usage for all models running on NCS + Raspberry Pi is summarized in table below. As expected, compressed model take up less memory.

Conclusion

It is very interesting to see that the model can be compressed by that much while still holding on to the same level of performance. The advantage here is when you have a specific object detection task but not a lot of data, you can exploit transfer learning to gain performance boost not possible with limited data. However, much of model will not contribute to extracting useful features for generating good predictions. This can be seen here where over 90% of the model can be removed. The pruning algorithm shared here, preserve and reorganizing important weights into a more compact model which is essential for edge inference.