Automated custom Scaled-YOLOv4 object detection model training for total beginners(no coding or AI skills needed) — free GPU

19 min readAug 18, 2021

The following tutorial and corresponding code written in Jupyter notebooks (1, 2, and 3 — full repo: link) will show you how to automatically train a custom Scaled-YOLOv4 object detection model (best state-of-the-art real-time model, April 2021 [1] )
Paper, Benchmark (MS COCO dataset), Model.

Since I have no formal training in coding, some of the provided code may not be very attractive. If the code does not work for your problem, feel free to comment and I will do my best to make it work. I merely hope to help some people and researchers without coding or AI experience to train/build the currently best real-time object detection model on their data.

Fig 1: Example object detection results of a YOLOv4 model on standardized goods labels with label and optical character recognition (OCR) of alphanumeric characters

The structure of the tutorial is as follows:

OUTLINE

1. INTRODUCTION
2. DETAILED WALKTHROUGH FOR CUSTOM DATASET
3. BASICS

As this tutorial should also be useful for complete beginners in the field of Object Detection, Computer Vision (CV), Deep Learning (DL), Neural Networks (NN), Machine Learning (ML), and Artificial Intelligence (AI), I added a brief overview of the most important terms and basics about AI/ML/DL as Basics section at the end of the article. The Basics section is not essential for the tutorial to work, but will help complete AI laymen to roughly understand the topic and the approximate functionality of the model used. If you have never heard of AI or object detection, I highly recommend reading the basics section first.
I will not describe the Scaled-YOLOv4 model in detail (YOLO = You Only Look Once). This was already well done here (+ paper). If you want to understand the model in detail, you should also take a closer look at summaries and papers about the predecessor models YOLOv1-v4 [Here (v1-v3), here (v3,v4), here and here(v4)] or the paper (v1, v2, v3, v4).

For this tutorial, I prefer to dive directly into practice and start with a short introduction about the tools we are going to use including a complete automated sample run-through (without required input as the first example for you).

1. INTRODUCTION:

1.1 Prerequisites and Tools

The only requirements needed are:

Google account with Google Drive (= free Google cloud)
Google Colab runtime (free cloud-based virtual machine with Linux running Jupyter notebooks as Jupyter notebooks). This is like a Linux-based computer running on the cloud including large computational power with a GPU (GPU=Graphic Processing Unit) and the most important packages for AI development. The provided Jupyter notebooks (1, 2, and 3) each run on such a Colab runtime.

Imagine that Colab is your new computer for AI stuff and Google Drive is your external hard drive with all your data on it (you have to use the external hard drive because your computer [Colab runtime] deletes all your individual data every time you restart it [or after 12h]). All together it should not take more than 1–2 hours until you can start the training of your own custom neural network-based object detection model without prior experience in coding.
You only have to copy/paste a few variables and press play (=Ctrl + F9 run all cells or lines of the Colab notebook).

Object detection is an ML application. Most typical ML/DL projects can be represented by an end-to-end (E2E) ML pipeline. The figure below illustrates the main generic steps of such. This project follows this organization.

The provided tutorial/code automates steps 1 - 4 for Scaled-YOLOv4 and produces a model that can then be applied/deployed (steps 5–6) afterward. Therefore, the tutorial is separated into three steps each with a corresponding Colab notebook:

1. Download, annotate and rearrange labelled image data OR label your own custom dataset + save in one folder (comes in basic YOLO format)
2. Rearranging of basic YOLO dataset files into specific folders for automated Scaled-YOLOv4 model use
3. Training of your custom Scaled-YOLOv4 model, plot results, save results and apply models to test data (= inference) and other trained Scaled-YOLOv4 models

An overview of the full workflow as a pipeline of this project is shown in the illustration below (high-res link). The graphic shows all notebook names (NOTEBOOK), the major pipeline steps (PIPELINE), the mandatory variables for you to insert into the specific notebook + Google Drive mount (highlighted in yellow), as well as the detailed dataset or/and model folder structure of the output of the notebook. This might be helpful if you want to know which notebook fulfills which specific function.

Fig 3: Full workflow steps of this project (automated custom Scaled-YOLOv4 training + test)

Enough talk let’s start doing something with the fully automated run-through.

1.2 First attempt using only the generic variables:

For this attempt, you don’t even have to copy/paste variables. Just click the links for the notebooks here: (1, 2, and 3). I have set up everything for an exemplary dataset of road traffic objects (as they are relevant for autonomous driving). The whole pipeline runs automatically. All you have to do is press play in the notebook (Ctrl+F9) and connect your Google drive to Colab (starts automatically and is shown in the next image below).

First of all open the first notebook (00_…) and press Ctrl+F9. Scroll down a bit until you get to the section (= cell in a notebook) where you have to mount Google Drive, which looks and works like this

Fig 4: Steps of Google Drive mount in Colab

After the successful Google Drive mount, you will see the folder gdrive in your root folder (in some cases it might be necessary to refresh the root folder by clicking on reload root folder).

Fig 5: Google Drive folder (gdrive) in Colab root folder (=content) after successful mount

The root direction you can see is the /content directory of your Colab runtime session. It is the default directory when you open a notebook in Colab, and it only contains the sample_data folder. Mounting Google Drive is the best way to get large amounts of data (as needed here) into the Colab runtime from your local computer and from Colab back to your local machine.

HINT: If one of the notebooks does run until the last cell/the end (after the Drive mount), at first try to restart the Colab runtime (image below) and start it again (Ctrl + F9). Many commands and directory names to enable full automatization depend on the order execution of the cells.

After notebook 1 is finished, go to grab some coffee (or wait up to ~5 minutes), as it takes a few minutes to upload the data to your Google Drive (cloud). The same holds true for notebook 2.
As last step, you simply open notebook 3 and press Ctrl+F9 again. After the Drive mount, the Scaled-YOLOv4 model will automatically start training for the pre-set dataset. After the training is completed, the (trained) model will be tested and applied to test data automatically. The results are also saved to your dataset folder. Your results should look similar to the example in the illustration below (Fig 7 and 8 are results for anomaly detection for polymer films).

Fig 7: Metrics of the finished model for anomaly detection

Fig 8: Exemplary commands to apply the model and save the results (mid — red and blue) to Google Drive folder (left blue) and an exemplary detection result file (left & right in yellow)

Now that you have seen how everything works, let’s start by training a model with a custom dataset.

2. DETAILED WALKTHROUGH FOR CUSTOM DATASET

Like in the showed ML pipeline we start with the data preparation for an object detection model in notebook 1.

2.1 Notebook 1 -> DATA PREPARATION for object detection (dataset generation with YOLO annotation/labels)

BASICS OBJECT DETECTION/YOLO DATASET
Image data for the training of an object detection model requires labeling at first. Labeling is the definition of the locations of the desired objects including the naming of the corresponding class c of the object by cuboid boxes. An exemplary folder structure of a dataset is shown in (b) in the image below.
It usually consists of images and corresponding text files (.txt), which usually have the same name. Inside the .txt-files is noted which objects (classes c) are located at which coordinates in the corresponding image [see (b) below or data folder structure of column 1 in Fig 3].
The .txt-files contain information about which objects are located where in the corresponding image. that includes a list of all classes (for YOLO it is classes.txt with one class per line/row (a)). I will demonstrate how to build your completely custom dataset in the following section.

Fig 9: (a) Example elements/ classes in classes.txt (here: two types of goods labels and digits/characters) and (b) base object detection dataset folder structure

Different object detection models have different ways of representing coordinates and classes in the textfiles. For each object the YOLO label annotations are written as five entries per line such as the entries are (from left to right):

1. classname c
2. x_coordinate_of_midpoint_of object (from top left of image)
3 .y_coordinate_of_midpoint_of object (from top left of image)
4. object_width
5. object_height

The coordinates start from the top left corner of the image and are normalized (between 0 and 1) by dividing through the image size as illustrated below.

Fig 10a: Label format of YOLO-annotation of .txt — file (link)

A specific example for a FullHD image (1920x1080 pixel) including two objects as white and yellow boxes (goods label for international supply chain purposes) is illustrated by Fig 10b.

Fig10b: Specific example of a labeled YOLO image with the corresponding label .txt - file

AUTOMATIC DATASET (OPEN IMAGES) DOWNLOAD

Notebook 1 automatically downloads a labeled dataset from Open Images Dataset V6 website and saves it all into one folder ready for transfer to notebook 2. This converts this standard YOLO labeled object detection dataset from all data inside one folder to Scaled-YOLOv4 folder structure). the

To do so, follow this link to the OpenImages webpage, select the objects you want to detect in the category menu: Settings are:
— Subset: Training
— Type: Detection
like highlighted in white boxes in Fig 11.

Fig 11: Example of object detection dataset of Jellyfish

Write the class names exactly like written in the dropdown menu of OpenImages webpage into the copied_class_names parameters section at the top of the first notebook (see Jellyfish in the white box in Fig 11 or first red box in Fig 12: copied_class_names = “Vehicle registration plate, Traffic sign, Car, Human body”).
Afterwards choose the number of images you want to download using the slider (Fig 12: number_of_download_images_per_classes = 100) and paste the root folder name, which states the folder name the dataset should be downloaded into (Fig 12: root_folder_YOLO_dataset : “/YOLO”)

Fig 12: Mandatory variables in notebook 1

Watch out!
If you are using the free Google Colab version, you shouldn’t use too many classes (not more than 7–10). Training of too many classes/images will take a long time and Google Colab could abort when too many computational resources is used. The free version is limited to 12h of working time. If runtime or computational resource is exceeded, your Colab VM will be restarted and all data will be lost.
To train larger datasets with more classes, there are several options. For example, you can subscribe to Colab Pro (but even here you only have 24 hours and not “infinite” computing power).
There is also the possibility to open your .ipynb notebook on your local computer via Jupyter lab or you can connect the Colab notebook with a cloud application like Google Cloud. I will demonstrate how to configure a Google Cloud account in only a few minutes (with 300$ GPU computing power for free) and connect it to your notebook in another short article. If you need to train many classes with the free Colab version, you can use the tutorial iteratively in multiple steps (I will integrate this functionality for automated training with a custom pre-trained file as soon as possible)

COMPLETE CUSTOM DATASET LABELLING:
If you want to use your own specific images, you need to label these on your local computer or in your browser (since Colab does not support software for this).
Makesense is the easiest tool as it runs online in your browser (tutorial).
There are also many good image-labeling tools for CV/ object detection for YOLO for your local computer, e.g., (Yolo_mark, BBox-Label-Tool, labelImg). Tutorials on how to use these tools are included in the link. My favorite one for Windows and Mac is YOLO Label. On Windows just download here, unzip and start YoloLabel.exe (thanks to Yonghye).
The only thing you need to do is to create a classes.txt text file containing the names of your classes (one in each line!) and put all your unlabeled image data in one folder. Another helpful tutorial is from The AI Guy here.

2.2 Notebook 2 -> REARRANGE DATASET for SCALED-YOLOv4

After you prepared your object detection dataset in YOLO annotation into one folder, the set needs to be rearranged into a specific folder structure as shown in column 2 in Fig 3. Your dataset is split into training, validation and test data your chosen percentages of the total set (I locked the percentage range to typical/useful standards).
You just need to copy the path of your labeled dataset inside your local Google Drive shown in Fig 14.

Copy and paste from google drive into the mandatory parameter form of notebook 2.

Fig 13: Mandatory variables cell of Notebook 2

You can choose the percentages of how your dataset should be split using the sliders.

Fig 14: Non-mandatory variables cell of Notebook 2

After setting your variables, press Ctrl + F9 again and scroll down to the GDrive Mount. Mount and wait until the notebook finishes the file rearrangement for the Scaled-YOLOv4 model in notebook 3. The notebook creates a .zip-file (scaledYOLO_Dataset.zip by default like the last entry of non-mandatory variables) including the rearranged folder structure with the dataset files.

2.3 Notebook 3 -> AUTOMATIC TRAINING and INFERENCE of SCALED-YOLOv4

The last step of the tutorial is to train the neural network on your rearranged dataset and finally test/apply it in the inference step. All parameters are mandatory and should be self-explanatory by name. I will try to explain them briefly, if you have any further questions please write me a comment. The parameters for the notebook are divided into 3 sections:

Dataset variables (Location and .zip-file name)
Training variables
Inference variables

The red boxes highlight all selectable variables you can change. Unfortunately, in google Colab it is currently only possible to apply the trained models to images and not to video data. However, I try to enable the video data application as soon as possible.

1.DATASET variables

The dataset parameters follow the same rules as in notebook 2. This time, you only need to enter the location or base directory of the rearranged dataset zip-file and the same name for the .zip-file you have chosen in notebook 2 (Default: scaledYOLO_Dataset.zip).

Fig 15: Mandatory variables Dataset (in Notebook 3)

2. TRAINING variables

The training variables are a little more extensive and complicated to choose. However, I have tried to give you only reasonable options for selection through dropdown menus, selection boxes, and predefined sliders as shown in Fig 16.

Fig 16: Mandatory variables Training (in Notebook 3)

At first, you choose, whether you want to train or just deploy an already trained model (first red box in Fig 16). If you skip training you can immediately deploy another trained model to an image dataset. I am going to introduce the mandatory variables for automated deployment right after the training variables.

As the first training variable, you choose the model branch which will be downloaded from GitHub (tiny, CSP or large). The branch describes which models are included (tiny to tiny, CSP to CSP and large contains all model sizes).
After the branch you choose the model size/depth using the dropdown menu (tiny, CSP, p5, p6, or p7). The model size determines which resolutions are used for image data (models and corresponding resolutions here) or in Fig 17b. Figure 17a includes the different depth sizes with the specific layers of the Scaled-YOLOv4 models (from smallest to largest: Tiny<CSP<P5<P6<P7).

Fig 17a: Different model sizes/ depths of Scaled-YOLOv4 [Source] or e.g. the definition of P7 as code: link

Fig 17b: Benchmark of average precision[AP] about latency time[ms] of best object detection models on MS COCO dataset using Nvidia TeslaV100 (higher AP+ lower latency time[ms] better) — April 2021 [Source] including image resolutions of models

As the next parameter, you define the number of epochs for training. An epoch is a complete training run-through of all training and validation images.
If you provide the path of a previous weights file (result file of a training), you can even continue a previous training session with the next two selection options.
With the following two checkboxes, you can select whether you want to use automatic image resolution and automatic batch size for the training. The batch size describes how many images can be processed in one training step and is limited by the computing power of the GPU. For large images over 1024 x1024 pixels, the free Colab version can process a maximum of 2 images per training step. If you uncheck the boxes you need to provide an image resolution as well as a batch size. With the last variable, you define which training run you want to save and deploy automatically (default value is 0). If you have started several training runs in one runtime session, you can select different runs. The logs (= results) of your training will automatically be illustrated in a dashboard which is shown in Fig 7 and 8 already additionally your trained model will automatically be deployed on your test data. The resulting images including the detections are also automatically displayed in the notebook.

3. INFERENCE variables

The last section of the variables section is intended for the automated inference of other trained models illustrated in Fig 18.

Fig 18: Mandatory variable for automated Inference of other trained models (in Notebook 3)

At first, you define if you want automated inference or not. Then you specify the location of your test images. If these files are located in multiple subfolders you need to click the checkbox. My code can also detect the subfolders by itself if you click the auto_subfolder checkbox. Otherwise, you need to give the names of the subfolders separated by a comma. Afterward define the output directory for the automated saving of your results. Next provide the complete path to the weights — file of your former training (from the training you want to use/test are located inside your GDrive). The last parameter specifies the confidence of the detections. With low confidence, more detections are made but with less accuracy. An exemplary result for inference is already shown on the right-hand side in Fig 8.

If you set up all variables correctly, your session should train and test automatically. Should any problems arise, comment and I will try to find a solution as quickly as possible.

The following Basics section is not essential for the tutorial to work, but will help complete AI laymen to roughly understand the topic and the approximate functionality of the model used.

3. BASICS:

After the following, you should roughly understand, what happens “behind the scenes” of this project (code for AI/ML/DL/Neural Network). If you are already familiar just skip the section.
AI, ML and DL are organized as following:

Fig: Organization of buzzwords AI, ML and DL [source]

A short and understandable overview of these buzzwords as well as a delimitation is given here: Overview (AI, ML, and DL).

In a few words:
AI makes it possible for machines to learn from experience, adjust to new inputs and perform human-like tasks with a minimum of human intervention. ML uses statistical techniques and maps/fits a given input dataset X to predefined mathematical functions f to make predictions (=ŷ), identify patterns, and make decisions, so that:
f(x) = ŷ. The output improves with experience (= more data).
DL sets up basic parameters (=weights) about the data. It trains the computer to learn the values of the weights on its own by recognizing patterns using many layers of processing called deep artificial neural networks (instead of organizing data to run through predefined equations like in ML).

Scaled — YOLOv4 is a DL application for computer vision (CV), which processes the image and video data. It is currently the best real-time model architecture for object detection. Object detection is an image processing task in the subfield of computer vision and characterized through detection of multiple objects, defined as classes as well as their position coordinates in the image as bounding boxes (red, green and blue boxes in the third image from the left). Example coordinates are x-/y- coordinate of the center from the top left corner and the width and height.

Fig: Basic image processing tasks from https://web.eecs.umich.edu/~justincj/teaching/eecs498/FA2020/

The detection is also called prediction or ŷ. How well different objects and their positions are detected determines the accuracy (or prediction error) of the model. Object detection is useful in many areas such as automotive driving, cancer cell detection, anomaly detection, etc. DL is accomplished with the use of deep artificial neural networks (ANN) (quick intro).

Neural networks are computing systems with interconnected nodes that work much like neurons in the human brain. Using algorithms, they can recognize hidden patterns and correlations in raw data, cluster and classify it, and — over time — continuously learn and improve.

ANNs are based on the Perceptron (right), which is a basic ‘learnable’ mathematical formulation of a biological neuron (left). It is also called a single-layer neural network. The perceptron evolved from the McCulloch-Pitts- neuron, which was the first attempt at mathematical modeling of a neuron. A graphical representation of these three terms is shown in the picture below.

Biological neuron (left), McCulloch — Pitts neuron (mid) and the resulting Perceptron as artificial neuron (right) (Source + Coding Perceptron)

The basics of the MCP-neuron, the Perceptron, and how the resulting mathematical function can “learn” (= orange step in the image above) is well introduced in this article. The output function of the perceptron referring the image results as:

Output = Activation function * (Bias + (Input Matrix * Weight matrix))

with:
-- Input matrix as X1 to Xn
-- Weight matrix as W1 to Wn
-- Bias B to allow shift activation. Bias is also used as W0
-- The activation function (step function in image) is used to introduce non-linearities into the network. (Generally, this is sigmoid for binary classification that results in output 0 or 1 for one perceptron.)

Many connected perceptrons form an artificial neural network and are also called multi-layer perceptron (MLP).

Multi — Layer Perceptron or fully connected artificial neural network (means all ‘neurons’ are interconnected with each other) with one hidden layer (Source)

MLPs with multiple hidden layers are called deep artificial neural networks. The number of hidden layers defines the depth of the network. If all neurons between two layers are interconnected, it is called a fully-connected neural network.

Most DL models (neural network architectures) for image processing tasks like object detection belong to the ML category of Supervised Learning.
In Supervised Learning, a model or function f is ‘learned’. It maps an an output ŷ (=prediction) to a given input x (= input data) so that f(x) = ŷ (like in standard ML).

Learning of f is based on multiple input-output pairs X-Y (= labeled dataset ) and contains many variables (e.g. W = weight matrix and B = bias matrix for neural network models as illustrated in the above images). The variables of f are “learned” depending on these input-output data pairs (X-Y). Learning is usually achieved by the backpropagation algorithm. Backpropagation is minimizing a certain loss function as the sum of squared differences between the label Y (=ground truth) and the prediction/detection Ŷ over the number of the dataset pairs N through adjusting the variable values (e.g. W and B) into the direction of a negative gradient (with respect to W and B) of the loss function.

Consequently, supervised learning is fitting a mathematical function with numerous variables to a labelled data set (X-Y) to obtain the best possible predictions (and the lowest possible loss).

In object detection X represents images and Y the corresponding labels as bounding boxes. They describe which object of a specific class C is located in the image data. E.g object detection image of Fig X: with c= 3 classes (cat, dog and duck) and corresponding bounding box labels y (red=cat, blue=dog, green=duck).

Learning of the variable values is called “training”. It is the first of two major steps in creating a supervised learning model, which are:

1. Training of the model (neural network based) = fitting variables of a mathematical model to labelled image/video data pairs (X — Y)

2. Inference of the trained model = application of the trained model to new (unknown) images

For training of deep learning models, more data with as many variations as possible is (almost always) better. The more data, the lower the chance that the model “learns by heart” = overfitting (polynomial degree of function too high). However, if the model does not train long enough it is underfitting (too low polynomial degree). Both has a negative impact on the detection performance of the model on new images. The picture below shows these relationships very clearly with underfitting on the top left, a good model in the top middle and overfitting on the top right. So stop training if the gradient/change in minimization of your loss function does not change significantly for a long time, but train until this behavior occurs. The optimal complexity of a model is determined by the bias-variance trade-off(or here: 0,1).

Model prediction error/loss via model complexity for training error/loss and prediction error for new data including graphs and markers for underfitting(top left) [high bias + low variance], good fitting (top right)[low bias + low variance] and overfitting (top right) [ low bias and high variance] (Source).

There are many good articles, resources and courses on the fundamentals of ML and DL (e.g. ML: 0,1,2,3,4,5,6 — DL: 0,1, 2, 3, 4 — DL4CV: 0,1).

I hope this tutorial helped you :)

Acknowledgment and references for this tutorial:

Paper: + Benchmark + Model

Thanks to Chien-Yao Wang, Alexey Bochkovskiy and Hong-Yuan Mark Liao for their research!

Last I would like to say special thanks to the following people that made this possible in the first place:

Aleksey Bochkovsky, one of the developers of this and the pioneer model YOLOv4 in an outstanding way — thank you very much!)
Diganta Misra- Mish activation function and JunnYu — PyTorch-CUDA implementation of Mish
Jonathan Hui for all his amazing blog articles about all types of DL based models and the AIGuy (custom YOLO videos and dataset downloader functions)
Last but not least I would like to say special thanks to Joseph Redmon the inventor of the YOLO model architecture (YOLOv1-v3.)