Human Detection using YoloV5 @35 FPS on Jetson Nano

5 min readJan 2, 2024

One of my hobby projects as part of Home Automation was to leverage the security camera feeds to detect if there is human intrusion, thus have the ability to filter out motion detection caused only by variations of light, shade or movement of leaves etc

Data Collection

The first thing to get done is to collect data / videos from the camera(s) that one would like to detect humans from. Capturing data in various light conditions and angles is helpful in making the detection much more accurate. My recommendation is capture about 200 to 300 frames from videos or still with a mix of frames with human / multiple humans and no humans. This is a good start point, as in future when you see incorrect detection with the eventually trained mode, one could upload additional images (specially the ones not detected correctly) and augment the dataset for re-training.

For the training, I used roboflow to upload the images, tag them and create a dataset. If there is interest in how to do this, let me know will write up an article on the same. In my dataset there are just two classes

Human
None

Once labelled, I created a dataset with the following pre-processing & augmentations

Resize to 320x320
Grayscale 25% of images
Use the default train-test-validation split

Training

Assembling DataSet

As you would probably know, training is the process of teaching a deep neural-network to do its task by feedback learning. In this case we would use PyTorch as the library for defining the network model. YoloV5 is a network architecture adept at multi-class detection, which we shall use to train on the specific task of Human Detection. Start with the following

Google Colaboratory

Edit description

colab.research.google.com

Alternate to using roboflow API in the above colab notebook. When I tried using the roboflow python API directly it was not intuitive for me, instead used the following steps

Sign-in to http://www.roboflow.com
Got to the Project -> Dataset and click the export button on top right, which will pop-up the following dialog

3. Click Continue and choose the Terminal tab

4. Head back to the colab notebook to Step 2 and scroll down to the following piece of code

5. We would now change this to the following

6. The above step would download the dataset to colab and unzip. Now navigate to the /content/dataset directory and open data.yaml and replace relative paths to full paths as below

Proceeding with Training

Step-3 in the training python call change the configuration (“weights” argument) to yolo5n.pt. The loss in accuracy with a good dataset is negligible and is a necessary step in the last part to convert this to a TensorRT engine model for faster inference.

Run the rest of the colab to train your model. The trained model will be created under the runs/exp directory and you would find a file called best.pt which is the trained model.

It is a good idea to save a copy of this colab notebook as , this would be helpful to do re-training with more images to make the model iteratively more accurate.

Inference

Iteration 1

The start point is the Ultralytics YoloV5 inference python code (https://github.com/ultralytics/yolov5) . Download the trained model and use detect.py with the “weights” argument as your model to run detection.

Iteration 2

Often frame to frame while the network camera would detect motion, by using contours and area of the contour we can reduce the inference time per video (not frame!) to achieve higher FPS.

You could use the following code to do rudimentary motion detection and only pass for detection to the model if there is motion. This saves a lot of CPU/GPU cycles when monitoring is done on sparsely populated feeds.

def detect_motion(self, curr_frame):
      gray = cv2.cvtColor(curr_frame, cv2.COLOR_BGR2GRAY)
      gray = cv2.GaussianBlur(gray, (21, 21), 0)

      frame_delta = cv2.absdiff(self.first_frame, gray)

      thresh = cv2.threshold(frame_delta, 25, 255, cv2.THRESH_BINARY)[1]
      thresh = cv2.dilate(thresh, None, iterations=2)

      cnts = cv2.findContours(thresh.copy(), cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
      cnts = imutils.grab_contours(cnts)

      # loop over the contours
      motion_detected = False
      for c in cnts:
        # if the contour is too small, ignore it
        if cv2.contourArea(c) < 200:
          continue
        motion_detected = True
        break
      return motion_detected

Iteration 3

Now after having done the optimization that we could at a code level, next level of optimization that I derived was moving the runtime engine to ONNX. To convert to ONNX, we could run the following locally

python3 export.py --weights <torch .pt model name> --imgsz 320 --include onnx --optimize --simplify --opset 12

Using ONNX runtime for the model speeds up the inference, ensure that you have onnxruntime-gpu installed to ensure that the GPU is used for inference.

Iteration 4

The final optimization as part of this blog would be to convert the ONNX model to a TensorRT engine. Now the TensorRT run time is already installed on Jetson nano. In my setup I am using a 2GB Nano, which had challenges converting YoloV5 models other than the “tiny (n)” model due to memory limitations. To convert to .engine format use the following

/usr/src/tensorrt/bin/trtexec --onnx=<onnx input> -saveEngine=<output engine file> --verbose

Running now detect.py with the engine option gives you about 30/35 FPS

Coming up getting to 35+ FPS by converting to C++

Human Detection using YoloV5 @35 FPS on Jetson Nano

Google Colaboratory

Edit description

Written by Sanjoy