Improving Gas Cylinder Digits Detection with Synthetic Dataset
The problem statement was a gas cylinder repair factory wanted to automate screening process to separate cylinder into two groups: enter factory or not-enter factory. The criteria were to inspect the last manufacturing date, if the tank was manufactured at least 5 years ago, then it can enter the repair factory, if not it will go back into circulation. So, an automated method was needed to read these manufacturing dates.
This was your basic OCR task; however, things were not that simple. The main problem was these digits were painted onto the cylinder’s where they got worn down over time. I have tried using Google AI service to read these digits, but it didn’t work.
Here in this post, I will share an example of how generated synthetic dataset can help improve model performance.
Data Collection
A jig was built where the cylinder was laid on its side. A high-def camera was installed above the cylinder to capture the its body, specifically where the manufacturing dates were painted.
The cylinders were slowly rotated while the camera captured the images.
We captured about 967 images of readable digits. To enhance the digits readability, we performed histogram equalization to brighten the image.
Once the data was cleaned and preprocessed, we annotated the digits with bounding boxes and framed this problem as an object detection task.
Dataset Analysis
We realized that we had a bad problem with data imbalance where the digits were not equally distributed. The majority of the digits were 0,1,2,3,5, and 6 while there were barely any 4,7,8, and 9 digit. Below is a frequency plot of digits with evenly distributed training, validation, test set split.
This will result in a bad performing model. Regardless, I went ahead and trained the model. The model I used at that time was YoloV3.
Model Training
Below is the result table describing the object detection performance for each digit. As expected, the performance was not that great for digits without much training data. Surprisingly, the model was not able to detect digit 1 even though there were a lot of examples in the training set.
Generate Synthetic Dataset
First, I collected the digits fonts from Microsoft Word which total to 1,630 samples of font evenly distributed among all digits.
Then I collected background images of cylinders as well as metal surfaces via internet image scraping.
After that I created a script that extracted out only the digits font then overlay them onto the metal surface background. I also applied some image processing techniques such as thresholding to create mask for the digits to make them look worn out.
In addition, I also apply different whiteness level to the fonts as well as different font size. The fonts were randomly selected and their location to overlay on the background were also randomly assigned. Below are some synthetic dataset outputs generated.
A total of 4275 images were generated for training, and 1519 images for validation which were greater in quantity compared to 969 images of real dataset.
Experiments
What I did was to first train the model with only the synthetic dataset then test the model on real-data test set. For the second model, I trained it on the synthetic dataset but then fine-tune the model on the real-data training set. Then, I compared the two models.
Below is the result of synthetic dataset only model.
Below is the result of synthetic dataset fined-tuned on real dataset model.
For easy comparison, I compare the mAP of 3 models and show the histogram plot of TP, FP, and FN by digits for each model.
Afterward, I did a simple hyperparameter search for the optimal confidence threshold that yielded the best F1-score which pushed the best performing model mAP from 44.06% to 47.32%. This is an example of how synthetic dataset can actually help improved model performance when dataset is small and imbalanced.
I want to thank Danuwasin Sittiworachat for preparing and analyzing the dataset. I also want to thank the project manager Sutthipan Techasena for coordinating the collection of this dataset.