Player and Ball Detection using Yolov8 + BotSORT tracking on a custom Dataset

7 min readDec 16, 2023

This article serves as part two of a 3-part blog series about a project I made recently while learning Computer Vision which is about developing a complete Football Analytics Model using Yolov8 + BotSORT tracking.

Read the previous blog here: https://medium.com/@nikhilc2209/an-image-annotation-guide-using-roboflow-for-object-detection-a4e30581b5cf

Objective: This blog is about understanding the YOLO architecture and training it on a custom dataset, then fine-tuning the model to get better results and running inference to understand what works best.

What does YOLO stand for?

YOLO(You Only Look Once) is a state-of-the-art Object Detection algorithm which found its fame due to its revolutionary technique of single-pass detection which improved its speed and accuracy to edge over its peers.

YOLOv1 was originally proposed in 2015 by treating Object Detection as a regression problem to compute class probabilities using bounding boxes. It has since undergone a lot of improvements and is currently under the maintained by Ultralytics which have released their latest version Yolov8.

Brief look at how YOLO algorithm works

As the name suggests the YOLO algorithm makes predictions on an image in a single pass, this is better than traditional methods where sliding windows are used over the whole image convolutionally or region proposals are used at multiple locations to localize objects.

The way YOLO does this is by dividing the image into a S x S grid (shown below) where each grid cell is responsible for producing the bounding box and confidence score output.

YOLO divides the input image into a S x S grid

For each grid cell in this image we compute the following:

Format of our target variable for each grid cell

The first cell refers to the confidence value which is nothing but a label which decides if any object lies inside the grid cell or not (0 or 1). If the answer is yes, then we move on to predict the values of the bounding box in xywh format where x & y are the co-ordinates of the center of the bounding box and w & h refers to the width and height of the bounding box. And lastly we have our Class Probability Distribution vector which contains prediction scores for each object label ranging between 0 and 1.

Example output of grid cells using the above image

If we take a look at the above image we can clearly see that the blue bounding box defines the true boundary of the dog object. When we take a look at the output vector of the green grid cell we are trying to predict the center of the blue bounding box which is our true label.

First we decide if there is an object in that grid cell, since the answer is yes we can continue further and assign the xywh values, you may have noticed that the width and height values exceed the 0 and 1 range. This is because the true label of that whole bounding box spans more than the green grid cell and takes a little more than 3 grid cells for height and width. Lastly, about our class probability scores the green grid cell only contains the doog object so we can easily assign the score 1 to the dog object and 0 to the car object.

Also, if we take a look at the yellow grid cell we know that it does not contain any object so we can simply assign confidence value 0 to its output vector. The “x” denotes the don’t care term which means we can safely neglect all other values from the output vector.

Training Yolov8 on our custom dataset

Now, let’s continue on our Player and Ball Detection Dataset from Roboflow and train it using Yolov8:

Dataset used: https://universe.roboflow.com/nikhil-chapre-xgndf/detect-players-dgxz0

First we need to install Ultralytics which maintains all the Yolo models:

pip install ultralytics

Next we need to set up a yaml file for configuring some training parameters:


path: absolute path to dataset (/path/to/dataset)
train: relative path from dataset (/train)
test: relative path from dataset (/test)
val: relative path from dataset (/val)

# Define Classes and their Labels

names:
  0: Ball
  1: Player
  2: Referee

Next we need to select a Yolov8 model weight to start our training with:

Different versions of the Yolo model with different parameters and use-cases

For our use case we’ll be using the Yolov8n (Nano) which is the lightest and the fastest model, it isn’t the most accurate model according to the mAP score but with enough training it can yield good results with better fps for video tracking.

from ultralytics import YOLO
import torch
import os

# Load the YOLOv8 model
model = YOLO('yolov8n.pt')

# TRAINING
if __name__ == '__main__':      
    results = model.train(data="config.yaml", epochs=50, patience=5)

As shown above we can simply load the data from the config.yaml file we set up earlier. We’ll start training for 100 epochs with a patience parameter spanning 10 epochs, this means that if no improvement is seen over 10 continuous epochs the model will stop the training early.

Upscaling Network Dimensions for better results

The biggest challenge I faced during training was poor mAP score on the ‘ball’ class and it took me a while to realize what’s going wrong. Yolov8 in general expects the input image to be in a square format and in cases of non-square images it defaults all the images to a width of 640px and corresponding height to maintain the aspect ratio unless specified as shown below.

Yolov8 resized image to 384x640 size to maintain aspect ratio

Using GIMP to compare size of “Ball” Class

Ball size in pixels in the Original image

Ball size in pixels in the Compressed image

The decrease in quality and size of the object image is easily visible in both the images, therefore leading to poor detection by the model. Increasing the image size while training, results in much better mAP score for not just the “Ball” class but all the other classes as well.

But that means we should always use the highest resolution images for training and inference to get the best results right? Well the answer depends, since increasing the network dimensions of a model will cause the model to use more training resources and make it slower. Therefore, we need to find a sweet spot to balance both speed and accuracy of our model.

Also, keep in mind that the network dimensions can only be a multiple of 32 according to the YOLO documentation. Therefore, after some scribbling I decided to use 1088 as the image size keeping in mind that the minimum image size of the smallest object should be greater than 15x15 pixels.

Model Performance

Once we finish our training we can view our training/validation results using the metrics shown above, Yolov8 prepares a directory full of graphs and visualizations for each metric in detail along with the model weights, shown above is just a brief summary.

We can now use this training results directory and upload the weights back to Roboflow to deploy as a model, this can be used to assist Image Labeling or can be simply deployed online for public use.

Running Inference using our model weights

Now, instead of using our default weights we can load the best weights that we just trained and use it for tracking video clips along with the BoTSORT tracker available with Ultralytics using the script below.

import cv2
from ultralytics import YOLO

# Load the YOLOv8 model
# model = YOLO('yolov8n.pt')          ### Pre-trained weights

model = YOLO('runs/detect/train2/weights/best.pt')          ### weights from trained model

# Open the video file
video_path = r"path/to/video"
cap = cv2.VideoCapture(video_path)

# Loop through the video frames
while cap.isOpened():
    # Read a frame from the video
    success, frame = cap.read()

    if success:
        # Run YOLOv8 tracking on the frame, persisting tracks between frames
        results = model.track(frame, persist=True, show=True, tracker="botsort.yaml")

        # Visualize the results on the frame
        annotated_frame = results[0].plot()

        # Display the annotated frame
        cv2.imshow("YOLOv8 Tracking", annotated_frame)

        # Break the loop if 'q' is pressed
        if cv2.waitKey(1) & 0xFF == ord("q"):
            break
    else:
        # Break the loop if the end of the video is reached
        break

# Release the video capture object and close the display window
cap.release()
cv2.destroyAllWindows()

Adding tracking to our detection model will help in tracking objects across continuous frames in a video clip, it achieves this by assigning a unique ID to each detected object. Therefore, it can also help in mapping the trajectory of an object such as a football over time and drawing paths based on its movement across frames.